rodrigo2019 / keras_yolo2

MIT License
46 stars 15 forks source link

Error when training with a small dataset #11

Closed ikerodl96 closed 5 years ago

ikerodl96 commented 5 years ago

Hello @rodrigo2019,

First of all, thank you for sharing this nice work. This is the most feasible, nice and complete YOLO v2 implementation in Keras that I have found. In my case, I am having a problem when training with my own small dataset (less than 30 samples). I know that there is nothing to learn with such a small number of samples but I will get more samples as soon as they are labeled. The purpose by now was to just try if all the code was working for my particular configuration and dataset. I think that the problem is related to the datagenerator. By simply looking at the traceback, can you guess which is the problem? I would appreciate it.

Here the traceback:


Using TensorFlow backend. 2019-06-06 22:27:45.839343: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz 2019-06-06 22:27:45.839619: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x24571e0 executing computations on platform Host. Devices: 2019-06-06 22:27:45.839672: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2019-06-06 22:27:46.014073: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-06-06 22:27:46.014607: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2456dc0 executing computations on platform CUDA. Devices: 2019-06-06 22:27:46.014662: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2019-06-06 22:27:46.015072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:00:04.0 totalMemory: 14.73GiB freeMemory: 14.60GiB 2019-06-06 22:27:46.015105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-06-06 22:27:46.490548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-06 22:27:46.490609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-06-06 22:27:46.490620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-06-06 22:27:46.491015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14115 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5) 100% 13/13 [00:00<00:00, 1985.87it/s] Seen labels: {'ref209': 13, 'ref209_1': 13, 'ref209_3': 13, 'ref209_4': 13, 'ref209_5': 13, 'ref209_6': 13, 'ref209_7': 13, 'ref210_1': 13, 'tool209_1': 13, 'tool209_2': 13} Given labels: ['ref209', 'ref209_1', 'ref209_3', 'ref209_4', 'ref209_5', 'ref209_6', 'ref209_7', 'ref210_1', 'tool209_1', 'tool209_2'] Overlap labels: {'ref209_3', 'ref209_5', 'ref209_6', 'ref210_1', 'ref209_1', 'tool209_2', 'ref209', 'tool209_1', 'ref209_7', 'ref209_4'} WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. Loading pretrained weights: ./backend_weights/full_yolo_backend.h5 (13, 13)


Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 416, 416, 3) 0


Full_YOLO_backend (Model) (None, 13, 13, 1024) 50547936


Detection_layer (Conv2D) (None, 13, 13, 75) 76875


YOLO_output (Reshape) (None, 13, 13, 5, 15) 0

Total params: 50,624,811 Trainable params: 50,604,139 Non-trainable params: 20,672


WARNING:tensorflow:From /content/keras_yolov2_proyecto/keras_yolov2/yolo_loss.py:73: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Epoch 1/8 Traceback (most recent call last): File "train.py", line 127, in main() File "train.py", line 123, in main score_threshold=config['valid']['score_threshold']) File "/content/keras_yolov2_proyecto/keras_yolov2/frontend.py", line 210, in train max_queue_size=max_queue_size) File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 181, in fit_generator generator_output = next(output_generator) File "/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py", line 601, in get six.reraise(sys.exc_info()) File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py", line 595, in get inputs = self.queue.get(block=True).get() File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get raise self._value File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(args, kwds)) File "/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py", line 401, in get_index return _SHARED_SEQUENCES[uid][i] File "/content/keras_yolov2_proyecto/keras_yolov2/preprocessing.py", line 250, in getitem img, all_objs = self.aug_image(train_instance, jitter=self._jitter) File "/content/keras_yolov2_proyecto/keras_yolov2/preprocessing.py", line 359, in aug_image obj[attr] = int(obj[attr] * scale - offx) KeyError: 'xmin'

rodrigo2019 commented 5 years ago

which kind of dataset are you using? it is csv or xml? This error occurs during the validation? your dataset is split between training and validation?

ikerodl96 commented 5 years ago

Hello @rodrigo2019, thank you so much for your fast reply.

The annotations of my dataset are specified by means of an xml (VOC format), my dataset is split into train and validation (80%-20%, I only provide both the images and annotations for the training set) and the error happens during the training. I attach an example:


Epoch 1/150

1/5 [=====>........................] - ETA: 25s - loss: 242.3871 2/5 [===========>..................] - ETA: 15s - loss: 240.8282 3/5 [=================>............] - ETA: 9s - loss: 231.9124 4/5 [=======================>......] - ETA: 4s - loss: 228.4962Traceback (most recent call last): File "C:/Users/iotxoa/Desktop/keras_yolov2_proyecto/train.py", line 127, in main() File "C:/Users/iotxoa/Desktop/keras_yolov2_proyecto/train.py", line 123, in main score_threshold=config['valid']['score_threshold']) File "C:\Users\iotxoa\Desktop\keras_yolov2_proyecto\keras_yolov2\frontend.py", line 210, in train max_queue_size=max_queue_size) File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper return func(*args, kwargs) File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\engine\training_generator.py", line 181, in fit_generator generator_output = next(output_generator) File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\utils\data_utils.py", line 601, in get six.reraise(sys.exc_info()) File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise raise value File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\utils\data_utils.py", line 595, in get inputs = self.queue.get(block=True).get() File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get raise self._value File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker result = (True, func(args, kwds)) File "C:\Users\iotxoa\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\utils\data_utils.py", line 401, in get_index return _SHARED_SEQUENCES[uid][i] File "C:\Users\iotxoa\Desktop\keras_yolov2_proyecto\keras_yolov2\preprocessing.py", line 250, in getitem img, all_objs = self.aug_image(train_instance, jitter=self._jitter) File "C:\Users\iotxoa\Desktop\keras_yolov2_proyecto\keras_yolov2\preprocessing.py", line 359, in aug_image obj[attr] = int(obj[attr] * scale - offx) KeyError: 'xmin'


Surfing the web, I have found that maybe its related to the data generator. Some people say that there should be an infinite loop (while True:, while 1:) inside the corresponding function in the preprocessing.py file. For example: stackoverflow.

I would appreciate any kind of help because I would like to use this Yolo implementation for a university project and I am in a hurry.

Many thanks in advance.

rodrigo2019 commented 5 years ago

Looks like you have some bad samples put print(self._images[i]['filename']) before this line after that, check your annotation corresponding for this file

rodrigo2019 commented 5 years ago

probably in some annotation is missing xmin value

ikerodl96 commented 5 years ago

Hello @rodrigo2019

I have just proved what you ahve suggested to me but there is no filename printing in the console. The usual traceback appears... but no more.

Anyway, I have manually revised all the xml file annotations for each of the images and all the . tags have their corresponding value... The thing is that many of them are 0 (something with is right). Could be the problem related to that? Additionally, I am annotating all the images using labelbox and I have implemeted in the same tool some checks to verify that all the corresponding annotations are correctly registered.

Many thanks for the fast answers and checks that you are giving to me. I really appreciate your help.

rodrigo2019 commented 5 years ago

I have just proved what you ahve suggested to me but there is no filename printing in the console. The usual traceback appears... but no more.

Try to write it in a file. Use try/except in the except condition you can write the file name

ikerodl96 commented 5 years ago

Hello again @rodrigo2019,

I have done what you have mentioned but no file is generated inside the except. This is very strange and time consuming...

ikerodl96 commented 5 years ago

Hi @rodrigo2019,

Finally I found the error. Among the images and annotations there was one which accidentally had a label in polygon format, something which is wrong since all of them should be rectangles (bounding boxes) as I specified when creating a labeling template in labelbox.

I have manually fixed that and now the training process seems to work. We will see the performace... Hope it is not too bad...

Thank you @rodrigo2019 for all your rapid replies and for sharing this nice work.

Best regards