Errors while running on the Oxford-IIIT Pets Dataset on Google Cloud

arptejan95 commented 6 years ago

We have been looking to run the steps mentioned over here the only difference being we used ssd_mobilenet_v2_coco.

While using runtime-version 1.6 the training just stops after some steps with the below error:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 370, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error

We also tried to use runtime-version 1.2 but we faced the below error:

in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data'

Then we replaced all tf.data with tf.contrib.data but then we faced an error as below :

in read_dataset records_dataset = filename_dataset.apply( AttributeError: 'RepeatDataset' object has no attribute 'apply'

Any help would be appreciated! Thanks!

jashshopin commented 6 years ago

I'm facing a similar error as well.

yhliang2018 commented 6 years ago

Adding the code owner for more input. @pkulzc , feel free to add more people to the thread.

pkulzc commented 6 years ago

Currently training on cloud with tf 1.4+ are not working, as mentioned here: https://github.com/tensorflow/models/issues/3071 https://github.com/tensorflow/models/issues/3788

This is a known issue and we're investigating. We're also doing some consolidation so this issue will go away anyway when the consolidation is done. If you really need to train on cloud, you can use a earlier version of my repo.

jashshopin commented 6 years ago

@pkulzc would this work with ssd_mobilenet_v2_coco?

KingsonSingh commented 6 years ago

Hii I am getting error when I run the command- python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_pets.config

in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 1000 1000), AttributeError: 'module' object has no attribute 'data'

Then I replaced all tf.data with tf.contrib.data but then we faced an error as below :

in read_dataset records_dataset = filename_dataset.apply( AttributeError: 'RepeatDataset' object has no attribute 'apply'

Any help would be appreciated! Thanks!

pkulzc commented 6 years ago

@KingsonSingh Sorry for the late response. Your issue is different from this one, please open a separate issue if the problem still happens and provide more details(following the instructions here)