tensorflow / models

Models and examples built with TensorFlow
Other
76.99k stars 45.78k forks source link

google cloud ImportError: No module named v1 #8683

Open power76 opened 4 years ago

power76 commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/...

2. Describe the bug

A clear and concise description of what the bug is. When I run the model training on Google Cloud as the instruction. There comes the error: The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/model_main.py", line 23, in import tensorflow.compat.v1 as tf ImportError: No module named v1

I have followed all the steps of the repository but can't solve the problem except changing the Google Cloud Runtime-version to 1.15. But it will introduce some new errors.

3. Steps to reproduce

Steps to reproduce the behavior.

4. Expected behavior

A clear and concise description of what you expected to happen.

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

pkulzc commented 4 years ago

tensorflow version is required to be 1.15 now. What are the other new errors ?

power76 commented 4 years ago

tensorflow version is required to be 1.15 now. What are the other new errors ?

hi, Thanks for your reply. When I use the runtime-version 1.15 of GCP to follow the training a pet detector. The errors are like below:

The replica worker 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL).

So I tried to revise the cloud.yml as masterType:large_model_v100 but get the same error. I think maybe alternative models will work and try faster_rcnn_inception_v2 and ssd_mobilenet_v1_coco. Unfortunately new errors came like below:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] save_relative_paths=True) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 828, in init self.build() File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 840, in build self._build(self._filename, build_save=True, build_restore=True) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 878, in _build build_restore=build_restore) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal restore_sequentially, reshape) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps name="restore_shard")) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps restore_sequentially) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2 name=name) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

I got confused totally.

power76 commented 4 years ago

I have successfully run the model "faster_rcnn_resnet101_pets" training on GCP. I thought the error above came from the windows OS probably. After I followed the instruction on Linux platform. All errors have gone. But when I try the model "ssd_mobilenet_v1_coco",the errors remained as above like:

master-replica-0 Command '['python', '-m', u'object_detection.model_main', u'--model_dir=gs://mymodel_bucket/model_dir', u'--pipeline_config_path=gs://mymodel_bucket/data/ssd_mobilenet_v1_pets.config', '--job-dir', u'gs://mymodel_bucket/model_dir']' returned non-zero exit status 1.

csingh27 commented 4 years ago

I am still struggling with this error. Any solutions ?