Open BrownZong opened 5 years ago
In your logs
No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/data/cuda/cuda-10.0/cuda/lib64:/data/cuda/cuda-10.0/cuda/lib64
So you train on CPU and tf is not compiled for your CPU
Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.
So it is not stuck it is just very slow, log will appear after 100 steps by default.
If you have issues with cuda installation just use docker provided by tensorflow
What is the top-level directory of the model you are using:models/research/object_detection/ Have I written custom code (as opposed to using a stock example script provided in TensorFlow):NO OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu16.04 TensorFlow installed from (source or binary):pip3 install tensorflow-gpu TensorFlow version (use command below):1.14.0 Bazel version (if compiling from source):N/A CUDA/cuDNN version : cuda 10.0 cudnn-7.3.1 GPU model and memory:GeForce RTX 2080Ti 11G Exact command to reproduce:python3 object_detection/model_main.py \ --pipeline_config_path="/data/code/vision_ori/models/research/object_detection/samples/configs/faster_rcnn_resnet50_coco.config" \ --model_dir="/data/code/vision_ori/my_checkpoints" \ --num_train_steps=200000 \ --sample_1_of_n_eval_examples=1 \ --alsologtostderr
Describe the problem
I want to retrain faster-rcnn on MSCOCO dataset from scratch with model_main.py. First I generate tfrecord file using create_coco_tf_record.py with COCO2017 Detection, and I got train/val file like this: coco_train.record-00000-of-00100. After that, I ran model_main.py , and the commang window outputs many warning logs. Then I got stuck at Saving checkpoints for 0 into /data/code/vision_ori/my_checkpoints/model.ckpt. I checked carefully and found the process stuck while building up a new MonitoredSession object .
Source code / logs
logs:
It gets stuck here for days and just won't go on.
building tf-record:
trying to train:
config file:
I would like to train from scratch so I removed 2 lines of code: