tensorflow / models

Models and examples built with TensorFlow
Other
76.94k stars 45.79k forks source link

Errors while running on the Oxford-IIIT Pets Dataset on Google Cloud #3937

Closed arptejan95 closed 6 years ago

arptejan95 commented 6 years ago

We have been looking to run the steps mentioned over here the only difference being we used ssd_mobilenet_v2_coco.

While using runtime-version 1.6 the training just stops after some steps with the below error:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 370, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error

We also tried to use runtime-version 1.2 but we faced the below error:

in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data'

Then we replaced all tf.data with tf.contrib.data but then we faced an error as below :

in read_dataset records_dataset = filename_dataset.apply( AttributeError: 'RepeatDataset' object has no attribute 'apply'

Any help would be appreciated! Thanks!

jashshopin commented 6 years ago

I'm facing a similar error as well.

yhliang2018 commented 6 years ago

Adding the code owner for more input. @pkulzc , feel free to add more people to the thread.

pkulzc commented 6 years ago

Currently training on cloud with tf 1.4+ are not working, as mentioned here: https://github.com/tensorflow/models/issues/3071 https://github.com/tensorflow/models/issues/3788

This is a known issue and we're investigating. We're also doing some consolidation so this issue will go away anyway when the consolidation is done. If you really need to train on cloud, you can use a earlier version of my repo.

jashshopin commented 6 years ago

@pkulzc would this work with ssd_mobilenet_v2_coco?

KingsonSingh commented 6 years ago

Hii I am getting error when I run the command- python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_pets.config

in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 1000 1000), AttributeError: 'module' object has no attribute 'data'

Then I replaced all tf.data with tf.contrib.data but then we faced an error as below :

in read_dataset records_dataset = filename_dataset.apply( AttributeError: 'RepeatDataset' object has no attribute 'apply'

Any help would be appreciated! Thanks!

pkulzc commented 6 years ago

@KingsonSingh Sorry for the late response. Your issue is different from this one, please open a separate issue if the problem still happens and provide more details(following the instructions here)

xinyuabcd commented 6 years ago

@KingsonSingh I'm getting a similar error as well. Have you solved?

KingsonSingh commented 6 years ago

@xinyuabcd

Yes ! You can solve this error by updating tensorflow package.

xinyuabcd commented 6 years ago

@KingsonSingh
OK! Thanks!

shuizhilinxin commented 6 years ago

@KingsonSingh In addition to upgrade tf version to1.4,Is there any other way ? for example, adapting tf1.2?thank you

jvrhenen commented 6 years ago

@pkulzc Any update on support of TensorFlow 1.4+ on CloudML?

Abduoit commented 6 years ago

How come training on cloud with tf +1.4 is not working BUT Google cloud here supports different environments including Tensorflow 1.8 and Python 3.5

thewozz commented 6 years ago

Any update concerning this problem?

pkulzc commented 6 years ago

This issue is obsolete, see our blog post for latest tutorials.