tensorflow / models

Models and examples built with TensorFlow
Other
76.87k stars 45.82k forks source link

Model training hangs #9581

Open SerQuicky opened 3 years ago

SerQuicky commented 3 years ago

I am currently trying to train an object detection model that uses the TensorFlow object detection API. (I used this page for some guidance)

System information

Using Tensorflow version 2.3.0 under Google Colab.

The Colab can be found under https://colab.research.google.com/drive/1-c4ZvcZAN5dmtTNXAXleiefVBEEVRB8p?usp=sharing

Describe the current behavior

After I start to train my model through model_main_tf2.py the process "hangs". After some logs, it basically does not log or write anything in the console. It prints no steps or anything after this.

Describe the expected behavior

The training of the model should start and some logs should be written to display the current progress of the trained model.

Standalone code to reproduce the issue https://colab.research.google.com/drive/1-c4ZvcZAN5dmtTNXAXleiefVBEEVRB8p?usp=sharing

This is the used pipeline.config

Other info / logs These are the logs before the process "hangs".

2020-11-24 11:36:12.860350: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-11-24 11:36:16.666837: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-11-24 11:36:16.739962: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2020-11-24 11:36:16.740071: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (e5f060d04e25): /proc/driver/nvidia/version does not exist 2020-11-24 11:36:16.780569: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300000000 Hz 2020-11-24 11:36:16.780823: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d89100 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-11-24 11:36:16.780861: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. W1124 11:36:16.788454 139718268630912 cross_device_ops.py:1202] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) I1124 11:36:16.788762 139718268630912 mirrored_strategy.py:341] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) INFO:tensorflow:Maybe overwriting train_steps: None I1124 11:36:16.794680 139718268630912 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I1124 11:36:16.794888 139718268630912 config_util.py:552] Maybe overwriting use_bfloat16: False INFO:tensorflow:Reading unweighted datasets: ['/root/models/shoe-dataset/train.record'] I1124 11:36:16.882893 139718268630912 dataset_builder.py:148] Reading unweighted datasets: ['/root/models/shoe-dataset/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/root/models/shoe-dataset/train.record'] I1124 11:36:16.884125 139718268630912 dataset_builder.py:77] Reading record datasets for input file: ['/root/models/shoe-dataset/train.record'] INFO:tensorflow:Number of filenames to read: 1 I1124 11:36:16.884304 139718268630912 dataset_builder.py:78] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1124 11:36:16.884416 139718268630912 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /root/models/research/object_detection/builders/dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. W1124 11:36:16.899756 139718268630912 deprecation.py:323] From /root/models/research/object_detection/builders/dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. WARNING:tensorflow:From /root/models/research/object_detection/builders/dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W1124 11:36:16.966850 139718268630912 deprecation.py:323] From /root/models/research/object_detection/builders/dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. W1124 11:36:24.396016 139718268630912 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead. W1124 11:36:27.515866 139718268630912 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead. WARNING:tensorflow:From /root/models/research/object_detection/inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W1124 11:36:29.340265 139718268630912 deprecation.py:323] From /root/models/research/object_detection/inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead.

Saduf2019 commented 3 years ago

@ymodak I am able to replicate the issue reported, please find the gist here.

npapapietro commented 3 years ago

Experience the very same thing.

Windows 10 tf 2.4 python 3.9 Nvidia 2080

Is there any update on this?

gborn commented 3 years ago

Experiencing the same issue, both with a windows 10 pc, and google colab, tensorflow 2.4

MaxPrimeAERY commented 3 years ago

Same issue!

gborn commented 3 years ago

It's solved now. Problem was my generated record files were empty, which in my case, was because the script couldn't find the train images. Got resolved after making sure the train folder path (also test folder path) is correct.

npapapietro commented 3 years ago

Wait, is the location of the images hardcoded and not dependent on the location in the tfrecord?

example = tf1.train.Example(features=tf1.train.Features(feature={
              'image/height': int64_feature(height),
              'image/width': int64_feature(width),
              'image/filename': bytes_feature(local_filePath.encode('utf8')),
              'image/source_id': bytes_feature(local_filePath.encode('utf8')),
              #other stuff
            }))
nickls commented 3 years ago

I had a similar issue to this, model_main_tf2.py would hang on my local system and would stop without any message on colab.

It ended up being a memory issue, with my process spiking to 12.5GB on memory on colab and getting silently killed.

I reduced the batch_size from 96 to 16 and the issue has gone away.

gborn commented 3 years ago

Wait, is the location of the images hardcoded and not dependent on the location in the tfrecord?

example = tf1.train.Example(features=tf1.train.Features(feature={
              'image/height': int64_feature(height),
              'image/width': int64_feature(width),
              'image/filename': bytes_feature(local_filePath.encode('utf8')),
              'image/source_id': bytes_feature(local_filePath.encode('utf8')),
              #other stuff
            }))

Issue was my GenerateRecords script couldn't find any images, and silently just created an empty record file(0 bytes), and therefore training never kicked off.

amanrock005 commented 3 years ago

Wait, is the location of the images hardcoded and not dependent on the location in the tfrecord?

example = tf1.train.Example(features=tf1.train.Features(feature={
              'image/height': int64_feature(height),
              'image/width': int64_feature(width),
              'image/filename': bytes_feature(local_filePath.encode('utf8')),
              'image/source_id': bytes_feature(local_filePath.encode('utf8')),
              #other stuff
            }))

Issue was my GenerateRecords script couldn't find any images, and silently just created an empty record file(0 bytes), and therefore training never kicked off.

do I need to configure some path in generate_tfrecord.py file??? but I'm passing arguments to my generate_tfrecord.py file like this

Create train data:

!python generate_tfrecord.py -x /content/training_demo/images/train -l /content/training_demo/annotations/label_map.pbtxt -o /content/training_demo/annotations/train.record

Create test data:

!python generate_tfrecord.py -x /content/training_demo/images/test -l /content/training_demo/annotations/label_map.pbtxt -o /content/training_demo/annotations/test.record

lizozom commented 2 years ago

@gborn thanks! This was my problem. I was following the tf blog tutorial and it seems like the record generation snippet has the wrong image folder path - it needs to have train \ test in the end).