tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

Not able to retrain SSD MobileNet V2 FPNLite on GCP AI Platform #9845

Open acidassassin opened 3 years ago

acidassassin commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

So i am currently trying to migrate my former TF 1.15 Object detection API scenario to TF2. I am training all my models on GCP AI platform. So what i have done now is trying the whole thing with TF2. I downloaded the pretrained SSD MobileNet V2 FPNLite 320x320 model from the TF2 repository to use it for my transfer learning. As instructed in the documentation, I cloned the repository and followed the steps to install the object detection api. Then i edited the model config file to fit to my needs. (I just change num_classes, fine_tuning_checkpoint and the path variables to point to my tfrecords) All my artifacts like the tfrecords, config and output dir are stored in a GCP bucket. After this i tried to start the training on GCP AI platform with the following command inside the research folder:

gcloud ai-platform jobs submit training object_detection_`date +%m_%d_%Y_%H_%M_%S` \
    --runtime-version 2.4 \
    --python-version 3.7 \
    --job-dir=gs://${MODEL_DIR} \
    --package-path ./object_detection \
    --module-name object_detection.model_main_tf2 \
    --region us-central1 \
    --config ${PATH_TO_LOCAL_YAML_FILE} \
    -- \
    --model_dir=gs://${MODEL_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

I also have to mention that i run this command inside a conda environment just like I did with the setup. After executing the gcloud command, the training is queued up. Then it takes a while to show up the first logs. Then suddenly the training stops with a bunch of always the same error messages:

"jsonPayload": { "message": "NameError: name 'open' is not defined", "created": 1616860467.038573, "lineno": 328, "pathname": "/runcloudml.py", "levelname": "ERROR" },

"jsonPayload": { "pathname": "/runcloudml.py", "levelname": "ERROR", "lineno": 328, "created": 1616860467.037546, "message": "W0327 15:54:26.939787 140658443433792 util.py:161] Unresolved object in checkpoint: (root).model._feature_extractor._fpn_features_generator.conv_layers.1.1.moving_mean" },

So these kind of errors repeat and repeat until the training jobs gets canceled. As you can see from the gcloud command, i am using the latest TF2 Object Detection API with TF 2.4 and Python 3.7

Does anyone know how to fix that or has the same problems?

Srikeshram commented 3 years ago

Will you please upload the code of runcloudml.py?

acidassassin commented 3 years ago

Unfortunately i cannot find it... I don't know wether this script is part of the object detection library nor tensorflow itself.

acidassassin commented 3 years ago

But now, after changing the fine_tune_checkpoint_type to "detection" i get the following errors (why are there so many problems with TF2 Object Detection?): The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_management.py", line 807, in save save_path = self._checkpoint.write(prefix) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 2011, in write output = self._saver.save(file_prefix=file_prefix, options=options) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1217, in save file_prefix_tensor, object_graph_tensor, options) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1162, in _save_cached_when_graph_building save_op = saver.save(file_prefix, options=options) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 300, in save return save_fn() File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 287, in save_fn sharded_prefixes, file_prefix, delete_old_dirs=True) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 504, in merge_v2_checkpoints delete_old_dirs=delete_old_dirs, name=name, ctx=_ctx) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 528, in merge_v2_checkpoints_eager_fallback attrs=_attrs, ctx=ctx, name=name) File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{ "error": { "code": 404, "message": "Not Found", "errors": [ { "message": "Not Found", "domain": "global", "reason": "notFound" } ] } } ' when deleting gs://tom-master-od-bucket/models/tf2_cocossdoid_output_240321_out/ckpt-3_temp/part-00000-of-00001.data-00000-of-00001 [Op:MergeV2Checkpoints]