Open SerQuicky opened 3 years ago
@ymodak I am able to replicate the issue reported, please find the gist here.
Experience the very same thing.
Windows 10 tf 2.4 python 3.9 Nvidia 2080
Is there any update on this?
Experiencing the same issue, both with a windows 10 pc, and google colab, tensorflow 2.4
Same issue!
It's solved now. Problem was my generated record files were empty, which in my case, was because the script couldn't find the train images. Got resolved after making sure the train folder path (also test folder path) is correct.
Wait, is the location of the images hardcoded and not dependent on the location in the tfrecord?
example = tf1.train.Example(features=tf1.train.Features(feature={
'image/height': int64_feature(height),
'image/width': int64_feature(width),
'image/filename': bytes_feature(local_filePath.encode('utf8')),
'image/source_id': bytes_feature(local_filePath.encode('utf8')),
#other stuff
}))
I had a similar issue to this, model_main_tf2.py would hang on my local system and would stop without any message on colab.
It ended up being a memory issue, with my process spiking to 12.5GB on memory on colab and getting silently killed.
I reduced the batch_size
from 96 to 16 and the issue has gone away.
Wait, is the location of the images hardcoded and not dependent on the location in the tfrecord?
example = tf1.train.Example(features=tf1.train.Features(feature={ 'image/height': int64_feature(height), 'image/width': int64_feature(width), 'image/filename': bytes_feature(local_filePath.encode('utf8')), 'image/source_id': bytes_feature(local_filePath.encode('utf8')), #other stuff }))
Issue was my GenerateRecords script couldn't find any images, and silently just created an empty record file(0 bytes), and therefore training never kicked off.
Wait, is the location of the images hardcoded and not dependent on the location in the tfrecord?
example = tf1.train.Example(features=tf1.train.Features(feature={ 'image/height': int64_feature(height), 'image/width': int64_feature(width), 'image/filename': bytes_feature(local_filePath.encode('utf8')), 'image/source_id': bytes_feature(local_filePath.encode('utf8')), #other stuff }))
Issue was my GenerateRecords script couldn't find any images, and silently just created an empty record file(0 bytes), and therefore training never kicked off.
do I need to configure some path in generate_tfrecord.py file??? but I'm passing arguments to my generate_tfrecord.py file like this
!python generate_tfrecord.py -x /content/training_demo/images/train -l /content/training_demo/annotations/label_map.pbtxt -o /content/training_demo/annotations/train.record
!python generate_tfrecord.py -x /content/training_demo/images/test -l /content/training_demo/annotations/label_map.pbtxt -o /content/training_demo/annotations/test.record
@gborn thanks! This was my problem.
I was following the tf blog tutorial and it seems like the record generation snippet has the wrong image folder path - it needs to have train
\ test
in the end).
I am currently trying to train an object detection model that uses the TensorFlow object detection API. (I used this page for some guidance)
System information
Using Tensorflow version 2.3.0 under Google Colab.
The Colab can be found under https://colab.research.google.com/drive/1-c4ZvcZAN5dmtTNXAXleiefVBEEVRB8p?usp=sharing
Describe the current behavior
After I start to train my model through model_main_tf2.py the process "hangs". After some logs, it basically does not log or write anything in the console. It prints no steps or anything after this.
Describe the expected behavior
The training of the model should start and some logs should be written to display the current progress of the trained model.
Standalone code to reproduce the issue https://colab.research.google.com/drive/1-c4ZvcZAN5dmtTNXAXleiefVBEEVRB8p?usp=sharing
This is the used pipeline.config
Other info / logs These are the logs before the process "hangs".
2020-11-24 11:36:12.860350: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-11-24 11:36:16.666837: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-11-24 11:36:16.739962: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2020-11-24 11:36:16.740071: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (e5f060d04e25): /proc/driver/nvidia/version does not exist 2020-11-24 11:36:16.780569: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300000000 Hz 2020-11-24 11:36:16.780823: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d89100 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-11-24 11:36:16.780861: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version WARNING:tensorflow:There are non-GPU devices in
tf.distribute.Strategy
, not using nccl allreduce. W1124 11:36:16.788454 139718268630912 cross_device_ops.py:1202] There are non-GPU devices intf.distribute.Strategy
, not using nccl allreduce. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) I1124 11:36:16.788762 139718268630912 mirrored_strategy.py:341] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) INFO:tensorflow:Maybe overwriting train_steps: None I1124 11:36:16.794680 139718268630912 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I1124 11:36:16.794888 139718268630912 config_util.py:552] Maybe overwriting use_bfloat16: False INFO:tensorflow:Reading unweighted datasets: ['/root/models/shoe-dataset/train.record'] I1124 11:36:16.882893 139718268630912 dataset_builder.py:148] Reading unweighted datasets: ['/root/models/shoe-dataset/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/root/models/shoe-dataset/train.record'] I1124 11:36:16.884125 139718268630912 dataset_builder.py:77] Reading record datasets for input file: ['/root/models/shoe-dataset/train.record'] INFO:tensorflow:Number of filenames to read: 1 I1124 11:36:16.884304 139718268630912 dataset_builder.py:78] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1124 11:36:16.884416 139718268630912 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /root/models/research/object_detection/builders/dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)
instead. If sloppy execution is desired, usetf.data.Options.experimental_deterministic
. W1124 11:36:16.899756 139718268630912 deprecation.py:323] From /root/models/research/object_detection/builders/dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)
instead. If sloppy execution is desired, usetf.data.Options.experimental_deterministic
. WARNING:tensorflow:From /root/models/research/object_detection/builders/dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() W1124 11:36:16.966850 139718268630912 deprecation.py:323] From /root/models/research/object_detection/builders/dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use
tf.data.Dataset.map() WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensor
and usetf.sparse.to_dense
instead. W1124 11:36:24.396016 139718268630912 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensor
and usetf.sparse.to_dense
instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating:seed2
arg is deprecated.Use sample_distorted_bounding_box_v2 instead. W1124 11:36:27.515866 139718268630912 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating:seed2
arg is deprecated.Use sample_distorted_bounding_box_v2 instead. WARNING:tensorflow:From /root/models/research/object_detection/inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.cast
instead. W1124 11:36:29.340265 139718268630912 deprecation.py:323] From /root/models/research/object_detection/inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.cast
instead.