Weights file not being read properly during training

mon28 commented 4 years ago

Hi,

Thanks for this python implementation. I just tried it out on my custom data files and the initial weights file provided by you. I also tried it out on the weights file yolov4.conv.137 which is shipped by darknet.

On your file I am getting the below error:

File "train.py", line 10, in <module>
    pre_trained_weights="yolov4.weights",
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/yolov4/tf/__init__.py", line 152, in train
    self.load_weights(pre_trained_weights)
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/yolov4/tf/__init__.py", line 322, in load_weights
    utils.load_weights(self.model, weights_path)
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/yolov4/core/utils.py", line 135, in load_weights
    assert len(wf.read()) == 0, "failed to read all data"
AssertionError: failed to read all data

On yolov4.conv.137 I get the following error:

File "train.py", line 10, in <module>
    pre_trained_weights="yolov4.conv.137",
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/yolov4/tf/__init__.py", line 152, in train
    self.load_weights(pre_trained_weights)
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/yolov4/tf/__init__.py", line 324, in load_weights
    self.model.load_weights(weights_path).expect_partial()
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
    return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1259, in load_weights
    with h5py.File(filepath, 'r') as f:
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/data/mtare/darknet-py/venv/lib/python3.6/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

Technically both should work as they do with the C implementation of darknet yolov4. Please help

hhk7734 commented 4 years ago

Hi @mon28

yolov4.weights is trained weights using data/classes/coco.names Maybe, the number of classes is different from yours. So remove pre_trained_weights or set to None.

this module recognize if xxx.weights or not, so, I renamed yolov4.conv.137 to yolov4.conv.137.weights but failed. perhaps, yolov4.conv.137's format is not .weights

hhk7734 commented 4 years ago

PS. This module is not yet fully tested. So there may be a bug.

mon28 commented 4 years ago

yolov4.weights is trained weights using data/classes/coco.names Maybe, the number of classes is different from yours. So remove pre_trained_weights or set to None.

Oh I see! It makes sense in prediction, but training should facilitate a different number of classes too. That's how the transfer learning could work. I hope this could get fixed soon.

Thanks for the quick response and a temp solution. This does work. The training finished very quickly with no stats on console. Any way to make it print something? What about adjusting hyperparameters? epochs/learning rate/ input width and height etc.

mon28 commented 4 years ago

Actually my training exited right after creating the very first checkpoint with no stats or anything in the logs. @hhk7734 could you help with this?

hhk7734 commented 4 years ago

hahah... there are concepts that I don't know because I only studied machine learning for a month.

When training, it seems that Yolo can get the weight of the rest of the network except for the part that changed by the number of classes. but I am not sure.

https://github.com/hhk7734/tensorflow-yolov4/blob/19ae0babc461b1a607bd79a4cc9fd9fb50d27046/py_src/yolov4/tf/__init__.py#L119-L148

In the above part, you can set the parameters. maybe ~/.local/lib/python3.6/site-packages/yolov4/tf/__init__.py

It will be changed later so that it can be assigned as a parameter.

mon28 commented 4 years ago

Alright i can change here but the training is not starting apparently. It wont print a single epoch.

hhk7734 commented 4 years ago

Can you share your training data?

mon28 commented 4 years ago

@hhk7734 not the image files as they are confidential but other files I can. Where would it be convenient to share personally?

hhk7734 commented 4 years ago

hhk7734@gmail.com attach files or add a download link.

mon28 commented 4 years ago

Have sent them to you. Please help resolve this.

hhk7734 commented 4 years ago

yolov4.core.dataset.Dataset class use format like:

/home/hhk7734/Desktop/coco/images/val2017/000000356094.jpg 185,35,342,384,0 405,117,471,335,0 304,84,459,124,34 402,228,424,253,35

but yolo dataset is

/home/hhk7734/Desktop/coco/images/val2017/000000356094.jpg

/home/hhk7734/Desktop/coco/images/val2017/000000356094.jpg /home/hhk7734/Desktop/coco/images/val2017/000000356094.txt

0 0.502445652174 0.51884057971 0.0244565217391 0.00579710144928

hhk7734 commented 4 years ago

To fix this, Dataset class should be modified to recognize the yolo data format. or change your data to coco data format.

mon28 commented 4 years ago

Could you explain the yolov4 data structure again? I don't get where the label id for each bounding box should be written. Also what is the overall structure? Mine is:

./data
    |
    | ------ obj.names (names of all classes)
    | ------ obj.data
    | ------ train.txt
    | ------ test.txt
    | ------ /obj
        | ----- all jpg files
        | ----- all txt files

in the format for each image where all of these are relative to width and height of the image

I can change it to yolov4 format if you could clarify it like above.

hhk7734 commented 4 years ago

Oh.. sorry, I accidentally deleted a part of your comment. Anyway

./data
    | ------ obj.names (names of all classes)
    | ------ train.txt
    | ------ /obj
        | ----- all jpg files

train.txt

<file path> <left top x pixel>,<left top y pixel>,<right bottom x pixel>,<right bottom y pixel>,<#class> <left top x pixel>,<left top y pixel>,<right bottom x pixel>,<right bottom y pixel>,<#class>  ...

train.txt + obj/xx.txt == coco_train.txt

mon28 commented 4 years ago

here x, y, width, and height, are they relative to image width/height or absolute?

hhk7734 commented 4 years ago

https://github.com/hhk7734/tensorflow-yolov4/blob/19ae0babc461b1a607bd79a4cc9fd9fb50d27046/scripts/coco_annotation.py#L73-L81 using script/coco_convert.py and script/coco_annotation.py can covert coco_data.json to coco_train.txt

I edited my comment above more clearly

hhk7734 commented 4 years ago

example

640X480 image left top is 0, 0 right bottom is 640,480 ROI is LRTB = (400, 450, 140, 250) person class name number is 0

person.jpg 400,140,450,250,0

mon28 commented 4 years ago

Alright, got it. Thanks :)

hhk7734 commented 4 years ago

Add Yolo train dataset format support: 7bf3527

hhk7734 commented 4 years ago

core: utils: implement a way to partially load weights: c4f092a

hhk7734 / tensorflow-yolov4

Weights file not being read properly during training #1