tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

TF2 Object Detection API training script model_main_t2 not working - Stuck on Waiting for new checkpoint - Timed-out waiting for a checkpoint #8883

Closed IvanBrasilico closed 4 years ago

IvanBrasilico commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection/model_main_tf2.py

2. Describe the bug

After running for a while, model_main_t2 get stuck on "Waiting for new checkpoint". Then ends with error: "Timed-out waiting for a checkpoint"

3. Steps to reproduce

https://github.com/IvanBrasilico/ajna_bbox The steps of tf2 installation are on the project README. Basically the steps described in the documentation (generate tf_records for training, download a model definition and check-point, edit pipeline.config with paths of tfrecord, run model_main_tf2.

4. Expected behavior

The expected behavior was to do the training procedure or at least pop an error message.

5. Additional context

The complete model_main_tf2.py console output is on the end of report

6. System information

Important to register that the example colab repository eager_few_shot_od_training_tf2.ipynb is running and training the same model, in the same virtualenv of the same machine.

Complete environment information:

https://github.com/IvanBrasilico/ajna_bbox/blob/master/tf_env.txt

Complete model_main_tf2 output:

(venv) ivan@ivan-G7-7588:~/PycharmProjects/ajna_bbox$ python models/research/object_detection/model_main_tf2.py --model_dir=/home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/ --checkpoint_dir=/home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint --alsologtostderr --pipeline_config_path=bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config --use-tpu=true WARNING:tensorflow:Forced number of epochs for all eval validations to be 1. W0715 23:32:23.856509 140079432734464 model_lib_v2.py:905] Forced number of epochs for all eval validations to be 1. INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: None I0715 23:32:23.856632 140079432734464 config_util.py:552] Maybe overwriting sample_1_of_n_eval_examples: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0715 23:32:23.856686 140079432734464 config_util.py:552] Maybe overwriting use_bfloat16: False INFO:tensorflow:Maybe overwriting eval_num_epochs: 1 I0715 23:32:23.856735 140079432734464 config_util.py:552] Maybe overwriting eval_num_epochs: 1 WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1. W0715 23:32:23.856801 140079432734464 model_lib_v2.py:920] Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1. 2020-07-15 23:32:23.881471: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-07-15 23:32:23.923686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-15 23:32:23.924041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1 coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 104.43GiB/s 2020-07-15 23:32:23.924195: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64 2020-07-15 23:32:23.924305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64 2020-07-15 23:32:23.925568: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-07-15 23:32:23.925901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-07-15 23:32:23.928778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-07-15 23:32:23.928903: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64 2020-07-15 23:32:23.932572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-07-15 23:32:23.932610: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2020-07-15 23:32:23.932881: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-07-15 23:32:23.939320: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2299965000 Hz 2020-07-15 23:32:23.939775: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x657f610 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-07-15 23:32:23.939791: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-07-15 23:32:23.941028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-15 23:32:23.941041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W0715 23:32:23.947229 140079432734464 dataset_builder.py:83] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. W0715 23:32:23.949348 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W0715 23:32:23.965300 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. W0715 23:32:29.178085 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W0715 23:32:30.630500 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. INFO:tensorflow:Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint I0715 23:32:33.767113 140079432734464 checkpoint_utils.py:125] Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint INFO:tensorflow:Found new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0 I0715 23:32:33.767870 140079432734464 checkpoint_utils.py:134] Found new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0 WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/eval_util.py:854: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W0715 23:33:02.120177 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/eval_util.py:854: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. INFO:tensorflow:Finished eval step 0 I0715 23:33:11.245014 140079432734464 model_lib_v2.py:782] Finished eval step 0 WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/utils/visualization_utils.py:618: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2.

W0715 23:33:11.261951 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/utils/visualization_utils.py:618: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2.

INFO:tensorflow:Performing evaluation on 21 images. I0715 23:33:30.778897 140079432734464 coco_evaluation.py:237] Performing evaluation on 21 images. creating index... index created! INFO:tensorflow:Loading and preparing annotation results... I0715 23:33:30.779220 140079432734464 coco_tools.py:116] Loading and preparing annotation results... INFO:tensorflow:DONE (t=0.00s) I0715 23:33:30.780228 140079432734464 coco_tools.py:138] DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=0.03s). Accumulating evaluation results... DONE (t=0.00s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.024 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.024 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.024 INFO:tensorflow:Eval metrics at step 0 I0715 23:33:30.815683 140079432734464 model_lib_v2.py:836] Eval metrics at step 0 INFO:tensorflow: + DetectionBoxes_Precision/mAP: 0.000143 I0715 23:33:30.818211 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP: 0.000143 INFO:tensorflow: + DetectionBoxes_Precision/mAP@.50IOU: 0.000286 I0715 23:33:30.818874 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP@.50IOU: 0.000286 INFO:tensorflow: + DetectionBoxes_Precision/mAP@.75IOU: 0.000000 I0715 23:33:30.819247 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP@.75IOU: 0.000000 INFO:tensorflow: + DetectionBoxes_Precision/mAP (small): -1.000000 I0715 23:33:30.819588 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP (small): -1.000000 INFO:tensorflow: + DetectionBoxes_Precision/mAP (medium): -1.000000 I0715 23:33:30.819919 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP (medium): -1.000000 INFO:tensorflow: + DetectionBoxes_Precision/mAP (large): 0.000215 I0715 23:33:30.820254 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP (large): 0.000215 INFO:tensorflow: + DetectionBoxes_Recall/AR@1: 0.000000 I0715 23:33:30.820581 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@1: 0.000000 INFO:tensorflow: + DetectionBoxes_Recall/AR@10: 0.023810 I0715 23:33:30.820914 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@10: 0.023810 INFO:tensorflow: + DetectionBoxes_Recall/AR@100: 0.023810 I0715 23:33:30.821241 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100: 0.023810 INFO:tensorflow: + DetectionBoxes_Recall/AR@100 (small): -1.000000 I0715 23:33:30.821578 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100 (small): -1.000000 INFO:tensorflow: + DetectionBoxes_Recall/AR@100 (medium): -1.000000 I0715 23:33:30.821907 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100 (medium): -1.000000 INFO:tensorflow: + DetectionBoxes_Recall/AR@100 (large): 0.023810 I0715 23:33:30.822265 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100 (large): 0.023810 INFO:tensorflow: + Loss/localization_loss: 0.189787 I0715 23:33:30.822557 140079432734464 model_lib_v2.py:839] + Loss/localization_loss: 0.189787 INFO:tensorflow: + Loss/classification_loss: 1.298645 I0715 23:33:30.822857 140079432734464 model_lib_v2.py:839] + Loss/classification_loss: 1.298645 INFO:tensorflow: + Loss/regularization_loss: 0.176113 I0715 23:33:30.823152 140079432734464 model_lib_v2.py:839] + Loss/regularization_loss: 0.176113 INFO:tensorflow: + Loss/total_loss: 1.664544 I0715 23:33:30.823446 140079432734464 model_lib_v2.py:839] + Loss/total_loss: 1.664544 INFO:tensorflow:Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint I0715 23:37:33.829480 140079432734464 checkpoint_utils.py:125] Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint INFO:tensorflow:Timed-out waiting for a checkpoint. I0716 00:37:33.181403 140079432734464 checkpoint_utils.py:188] Timed-out waiting for a checkpoint.

Jacobsolawetz commented 4 years ago

@IvanBrasilico I hope this can help you!

I wrote a tutorial to train EfficientDet in Google Colab with the TensorFlow 2 Object Detection API.

You can run this tutorial by changing just one line for your custom dataset import. I hope this tutorial allows newcomers to the repository to quickly get up and running with TensorFlow 2 for object detection!

In the tutorial, I write how to:

Acquire Labeled Object Detection Data Install TensorFlow 2 Object Detection Dependencies Download Custom TensorFlow 2 Object Detection Dataset Write Custom TensorFlow 2 Object Detection Training Configuration Train Custom TensorFlow 2 Object Detection Model Export Custom TensorFlow 2 Object Detection Weights Use Trained TensorFlow 2 Object Detection For Inference on Test Images

sambhusuryamohan commented 4 years ago

@IvanBrasilico Please use the command you used without --checkpoint_dir option. Adding that option changes the mode to evaluation only and not training. Hope that solves .

Command to start training is as given python models/research/object_detection/model_main_tf2.py --model_dir=/home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/ --alsologtostderr --pipeline_config_path=bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config --use-tpu=true

IvanBrasilico commented 4 years ago

Thanks very much!! It was my fault. I am closing the issue.

As a comment, tensorflow ecosystem is great, but the object detection API needs a better documentation. Even the code needs some cleaning. The documentation is poor, so I made a lot of trials on pipeline.config and on command line. The error messages are weird, then I tried to read the code. The code is complex and make some assumptions: filenames have to be on some patterns (like ckpt-0), directories needs to be on some patterns, unless the code breaks.

Now I am training the net, but there are no messages to see what is really going on (like "found NN examples", training XXX), loss, etc, etc. Just when evaluation runs we get some messages.

I am not a begginer, I fine tuned a lot of keras networks for computer vision tasks before, and even made some of my own from scratch. I made all the courses from deeplearning.ai and others, and I am having a very bad time simply trying to use the Object Detection API.

The keras/tf2 API is great, and very well documented. Would be great if that package achieves same quality. At least, this object detection API needs a complete working example(generate train/test set/ train and evaluate/save/export model to production). The colab is incomplete (no model saving nor evaluation) and don't use the same patterns as the scripts(tfrecords, tensorflow serving export, etc). The scripts and the pipeline config file needs a lot of effort to know how to use, by trial and error. I will try a litle more, because I am using tensorflow model serving on production for another models and would like to stay with it, but if I fail more times I will end going to mathport Mask-RCNN or even PyTorch/FastAI.

tazu786 commented 4 years ago

Hi @IvanBrasilico , are you able anyway to run a proper evaluation job as described here and as you were trying to do by setting the checkpoint_dir? I have trained my models and now trying to get their performance in terms of mAP but I get the error "Timed-out waiting for a new checkpoint.

nilskk commented 4 years ago

I am having the same problem. I trained my network, but I can't evaluate afterwards. It just says "Waiting for a new checkpoint". There has to be a way to first run training and then run evaluation afterwards based on a saved checkpoint???

radhikam01 commented 4 years ago

Hello @IvanBrasilico, were you able to find a solution for this? I am facing the same 'INFO:tensorflow:Timed-out waiting for a checkpoint.' error while trying to evaluate my model. I am attaching a screenshot of my config file as reference if that helps.

config file screenshot

Any help would be appreciated, thank you!

gilmotta commented 4 years ago

Hello everybody! I am having the same issue, Has anyone found a solution? @IvanBrasilico vc conseguiu resolver esse problema? pode me ajudar por favor?

ojasvisancheti commented 3 years ago

I am facing the same problem did anyone got a solution ?

joker9605 commented 3 years ago

may i know have any solution???

ojasvisancheti commented 3 years ago

i found something in StackOverflow https://stackoverflow.com/questions/64510791/tf2-object-detection-api-model-main-tf2-py-validation-loss. But it's not helping me if i run in another terminal the process is stopping my checkpoint to generate.

Please help if anyone will get something

maroua-yam commented 2 years ago

for those who still have the same problem, for my case I solved this by changing the 'eval -time' in model-main-tf2.py, by default it was 3600 seconds so I increase it, ( by calculating the time between 2 checkpoints).
I hope it helps somebody.