marcbelmont / cnn-watermark-removal

Fully convolutional deep neural network to remove transparent overlays from images
1.23k stars 227 forks source link

Crashes on training #23

Open JohnBakery opened 5 years ago

JohnBakery commented 5 years ago

When I run

G:\Users\user\Desktop\cnn>C:\Users\user\AppData\Local\Programs\Python\Python36\python.exe watermarks.py --logdir=save/

The trainer crashes after exactly 17000 TFRecords with the following message

Traceback (most recent call last):
  File "watermarks.py", line 295, in <module>
    tf.app.run()
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "watermarks.py", line 290, in main
    train(sess, globals()[FLAGS.dataset])
  File "watermarks.py", line 188, in train
    min_opacity, max_opacity)
  File "G:\Users\user\Desktop\cnn\dataset.py", line 16, in batch_masks
    for _ in range(FLAGS.batch_size)], 0)
  File "G:\Users\user\Desktop\cnn\dataset.py", line 16, in <listcomp>
    for _ in range(FLAGS.batch_size)], 0)
  File "G:\Users\user\Desktop\cnn\dataset.py", line 39, in create_mask
    mask, tf.random_uniform([], -max_angle, max_angle, tf.float32))  # Costly
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\image\python\ops\image_ops.py", line 75, in rotate
    interpolation=interpolation)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\image\python\ops\image_ops.py", line 170, in transform
    images, transforms, interpolation=interpolation.upper())
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\image\ops\gen_image_ops.py", line 94, in image_projective_transform
    interpolation=interpolation, name=name)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2632, in create_op
    set_shapes_for_outputs(ret)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1911, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 595, in call_cpp_shape_fn
    require_shape_fn)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 654, in _call_cpp_shape_fn_impl
    input_tensors_as_shapes, status)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\contextlib.py", line 88, in __exit__
    next(self.gen)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'ImageProjectiveTransform' in binary running on USER-PC. Make sure the Op and Kernel are registered in the binary running in this process.

Am I doing something wrong?

marcbelmont commented 5 years ago

Maybe it's an issue with the version of the packages. Can you do pip freeze?

JohnBakery commented 5 years ago

bleach==1.5.0 colorama==0.4.1 cycler==0.10.0 decorator==4.3.2 html5lib==0.9999999 ipython==6.0.0 ipython-genutils==0.2.0 jedi==0.13.2 Markdown==3.0.1 matplotlib==2.0.0 numpy==1.12.1 olefile==0.46 parso==0.3.4 pickleshare==0.7.5 Pillow==4.1.0 prompt-toolkit==1.0.15 protobuf==3.6.1 Pygments==2.3.1 pyparsing==2.3.1 python-dateutil==2.8.0 pytz==2018.9 simplegeneric==0.8.1 six==1.12.0 tensorflow==1.3.0 tensorflow-tensorboard==0.1.5 traitlets==4.3.2 wcwidth==0.1.7 Werkzeug==0.14.1

marcbelmont commented 5 years ago

Thanks. I don't see anything wrong. Is it consistently crashing after 17000 training steps?

JohnBakery commented 5 years ago

Yes, exactly at 17000, every single time. If I restart without deleting the tfrecords files, it will crash right away. If I delete them, it will run until 17000. Looking at the files voc-17000.tfrecords is 65MB, while all others are ~105MB. Not sure if that matters.

marcbelmont commented 5 years ago

This one is smaller because it is the last one (it contains less images). You can try removing it.

JohnBakery commented 5 years ago

Deleted the voc-17000, restarted learning and it crashes right away with the same message

marcbelmont commented 5 years ago

It looks like a Windows specific issue. https://github.com/tensorflow/tensorflow/issues/9672 Try using tensorflow==1.4.0 instead

JohnBakery commented 5 years ago

I updated to 1.4.0 and it solved the crashing issue. However, it gets stuck after saying Shuffle buffer filled.

WARNING:tensorflow:From G:\Users\user\Desktop\cnn\dataset.py:110: TFRecordDataset.__init__ (from tensorflow.contrib.data.python.ops.readers) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.TFRecordDataset`.
2019-02-19 16:30:40.833944: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 159 of 10000
2019-02-19 16:30:50.826511: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 320 of 10000
2019-02-19 16:30:57.231075: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.

I thought this could've been due to tenserflow-tensotboard being incompatible, since when I switched to 1.4.0 I got the following message

tensorflow 1.4.0 has requirement tensorflow-tensorboard<0.5.0,>=0.4.0rc1, but you'll have tensorflow-tensorboard 0.1.5 which is incompatible.

so I updated to 0.4.0rc1, but it still hangs at Shuffle buffer filled.

So I let it run and apparently it is doing something

2019-02-19 16:51:05.815504: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 147 of 10000
2019-02-19 16:51:15.779815: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 248 of 10000
2019-02-19 16:51:25.787975: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 379 of 10000
2019-02-19 16:51:29.268296: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.
2019-02-19 18:37:48.078317: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 153 of 10000
2019-02-19 18:37:58.045792: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 307 of 10000
2019-02-19 18:38:05.494099: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.
2019-02-19 20:24:20.950043: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 157 of 10000
2019-02-19 20:24:30.912401: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 317 of 10000
2019-02-19 20:24:38.561894: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.
2019-02-19 22:10:38.268626: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 155 of 10000
2019-02-19 22:10:48.274232: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 314 of 10000
2019-02-19 22:10:55.083987: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.

For how long should I let it run?

toyssamurai commented 5 years ago

I want to chime in, too. I also got to "Shuffle buffer filled" and the last one I got is about an hour ago. Did your finish somehow at the end?