power76 commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ ] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[x] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/...

2. Describe the bug

A clear and concise description of what the bug is. When I run the model training on Google Cloud as the instruction. There comes the error: The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/model_main.py", line 23, in import tensorflow.compat.v1 as tf ImportError: No module named v1

I have followed all the steps of the repository but can't solve the problem except changing the Google Cloud Runtime-version to 1.15. But it will introduce some new errors.

3. Steps to reproduce

Steps to reproduce the behavior.

4. Expected behavior

A clear and concise description of what you expected to happen.

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Google cloud
Mobile device name if the issue happens on a mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below):
Python version:2.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

pkulzc commented 4 years ago

tensorflow version is required to be 1.15 now. What are the other new errors ?

power76 commented 4 years ago

tensorflow version is required to be 1.15 now. What are the other new errors ?

hi, Thanks for your reply. When I use the runtime-version 1.15 of GCP to follow the training a pet detector. The errors are like below:

The replica worker 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL).

So I tried to revise the cloud.yml as masterType:large_model_v100 but get the same error. I think maybe alternative models will work and try faster_rcnn_inception_v2 and ssd_mobilenet_v1_coco. Unfortunately new errors came like below:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] save_relative_paths=True) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 828, in init self.build() File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 840, in build self._build(self._filename, build_save=True, build_restore=True) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 878, in _build build_restore=build_restore) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal restore_sequentially, reshape) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps name="restore_shard")) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps restore_sequentially) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2 name=name) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, kwargs) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init** self._traceback = tf_stack.extract_stack()

I got confused totally.

power76 commented 4 years ago

I have successfully run the model "faster_rcnn_resnet101_pets" training on GCP. I thought the error above came from the windows OS probably. After I followed the instruction on Linux platform. All errors have gone. But when I try the model "ssd_mobilenet_v1_coco",the errors remained as above like:

master-replica-0 Command '['python', '-m', u'object_detection.model_main', u'--model_dir=gs://mymodel_bucket/model_dir', u'--pipeline_config_path=gs://mymodel_bucket/data/ssd_mobilenet_v1_pets.config', '--job-dir', u'gs://mymodel_bucket/model_dir']' returned non-zero exit status 1.

csingh27 commented 4 years ago

I am still struggling with this error. Any solutions ?

tensorflow / models

google cloud ImportError: No module named v1 #8683

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

hi, Thanks for your reply. When I use the runtime-version 1.15 of GCP to follow the training a pet detector. The errors are like below:

The replica worker 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL).

So I tried to revise the cloud.yml as masterType:large_model_v100 but get the same error. I think maybe alternative models will work and try faster_rcnn_inception_v2 and ssd_mobilenet_v1_coco. Unfortunately new errors came like below: