Closed Bidski closed 4 years ago
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce
Updated to make requested details more obvious
I've met this issue too. Although I used faster_rcnn_resnet01_pets to train my own dataset. I've learned from someone who said this might be caused by the small pictures, such as 15x30 pixels in the training set. I will try to remove these samples in my dataset and train it again. If any update, I will post it here.
UPDATE: After an investigation, I found the small size of samples are not the root cause of the crash. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.) Actually, what I found were, in the annotation file, the wrong order of the coordinates. For instance, the annotations will mark the coordinates, named x1, y1, x2 and y2. Here, x1 should less than x2, so do y1 and y2. However, in my case, some of the annotated samples show that x1>x2, or y1>y2. which cause the crash issue. After I correct the order of the coordinates, crash gone. Hope this information can help someone.
I just re-checked my dataset. I have no entries where x1 > x2
or `y1 > y2. However, this error still persists for me.
@Bidski , when did you encounter this crash during the training? At the very beginning or running after a while? Actually, I am thinking whoever meet this crash, somehow it should be related to the dataset (maybe it caused by the wrong order of the coordinates, or may be caused by the invalid coordinates' values). Maybe you can split the data set into pieces for training to check in which part of the data will cause this crash happen again. In this way, I think at least you can determine the scope of crash-related data, so that you might be find what is the real problem to cause your crash.
I have the same problem, and I tried
ssd_mobilenet_v1_coco,
ssd_mobilenet_v1_ppn,
ssd_mobilenet_v2_coco,
ssdlite_mobilenet_v2_coco
They also can't work, but if I use faster_rcnn_inception_v2_coco
, it works well.
I've similar problem. Maybe related to: #4881
Very same problem here, trying @myuanz set of NN and tensorflow r1.8 with gpu support compiled in Windows 10 Win64.
I had this problem. It was solved when I checked:
if object_area <= (width * height) / (16 * 16): raise Exception('object too small Error')
Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
@carlosfab you are absolutely right. You saved the day. Thank you
I am having issues similar to #1881 and #1907.
Using google object_detection api and the latest tensorflow master repo built with CUDA 9.1 on Linux Mint 18.2 (bases on Ubuntu Xenial).
Have I written custom code: No, but custom dataset OS Platform and Distribution: Linux Mint 18.2 (based on Ubuntu 16.04) TensorFlow installed from: Tensorflow built and installed from github master TensorFlow version: 1.8.0-rc0-cp35-cp35m-linux_x86_64 Bazel version: 0.12.0 CUDA/cuDNN version: CUDA 9.1, cuDNN 7.1 GPU model and memory: GTX1080Ti 11GB Exact command to reproduce: cd tensorflow/models/research && python3 object_detection/train.py --logtostderr --pipeline_config_path=/path/to/pipeline_config.pbtxt --train_dir=/path/to/train/folder
Describe the problem
I am trying to fine-tune the ssd mobilenet v1 coco model using my own dataset. I am using the default config file that is provided in the object detection repository.
From the very first global step of training I receive the "LossTensor is inf or nan. : Tensor had NaN values" error. Things I have tried:
From what I can tell, these are all of the things that were suggested in #1881 and #1907, none of these have worked for me.
Source code / logs