luxonis / depthai-ml-training

Some Example Neural Models that we've trained along with the training scripts
MIT License
119 stars 33 forks source link

YOLOv6 notebook not working due to new release #48

Closed blaz-r closed 1 year ago

blaz-r commented 1 year ago

Hello,

I've been trying to use YOLOv6 notebook and everything was okay until I got to training. I get Can't get attribute 'SimConv' on <module 'yolov6.layers.common'

When I checked v6 repo, I found the following issue: https://github.com/meituan/YOLOv6/issues/799 Turns out that new version was released v0.4.0 (yesterday), so the weights downloaded in notebook are not correct.

I did try to change the link to download 0.4.0 weights. Model then trained, but output from that can't be converted in blobconverter: image

So I figured that something added in 0.4.0 is not supported and I'm writing this here. In the meantime, if anyone else is facing the same issue, just checkout tag 0.3.0 in notebook, and everything should work the same until release 0.4.0 is supported.

tersekmatija commented 1 year ago

Thanks for reporting @blaz-r. Seems like YoloV6 made a new 0.4.0 release recently @HonzaCuhel.

Code reconstruction and normalization of convolution operators.

Which might also cause some import issues. We'll investigate and try to add support for it.

HonzaCuhel commented 1 year ago

@blaz-r Thank you very much for reporting! We have for now updated the YoloV6 Colab Training notebook, so it would work. In the meantime, we are intensively working on supporting YoloV6 R4 weights. We will keep you updated.

KFStrata commented 1 year ago

I have swtiched over to 0.3.0 release and am getting the same message using custom data. If I try and use the default weights, the conversion tool works. Attached is the error message I receive.

image

tersekmatija commented 1 year ago

@KFStrata is this model also trained with 0.3.0? If the model was trained and exported using 0.3.0 conversion should work and we validate it with this notebook. CC @HonzaCuhel for more info.

KFStrata commented 1 year ago

Yes. I hit this issue today using the 0.4.0 release. I found this thread and rolled back to the 0.3.0 release and retrained a new model using only 10 epochs to confirm a solution and had the same issue. I can try 0.2.0 and see if that works but it works with the weights you get from the site but not custom models.

I ran a similar setup on my machine to the instructions in your google collab.

tersekmatija commented 1 year ago

@KFStrata I just trained a model using the latest updated Colab which checks out the 0.3.0 release. Can you try doing the same and exporting that model?

As a side note: we have investigated the 0.4.0 version and will be integrating it into the tools, but no full ETA yet.

KFStrata commented 1 year ago

I don't use collab due to the nature of the datasets I train on. Can you confim a local machine can output correctly? I will follow the instructions in your collab on my local machine and try to get your results however I do not use a voc dataset but rather a modified version of a coco set up. Will that cause any issues?

KFStrata commented 1 year ago

For anyone else who ends up here with this conversion issue:

I followed the instructions in the this collab I also hit the error of Attribute: int issue. I just went to yolov6/data/datasets.py and replaced the np.int with np.int_ and I was able to train and then convert successfully.

tersekmatija commented 1 year ago

Hey, glad to see you resolved the issue!

And yeah, as long as the datasets are in the right format, it shouldn't matter. Also, you can always just download the notebooks and run them locally using something like Jupyter Lab.

As far as the np.int_/np.int issue - it's likely related to the Numpy library version. Upgrading/downgrading usually helps.

KFStrata commented 1 year ago

I can no longer convert using the same details as last time. I am working back through in case of user error but have not changed anything since my last update. When I go to add the latest model trained, the portal shows either the try to convert using the 'Yolov6 (R1)' option or if i use the same model on the R@, R3, it tells me to use the R1.

tersekmatija commented 1 year ago

CC @HonzaCuhel if latest deploy broke something. Would it be possible to share the .pt weights @KFStrata ? You can also send them to us over an email if there are some privacy concerns, but it would be the easiest for us to debug.

KFStrata commented 1 year ago

Yes I can. Which email would you like me to send it to?

tersekmatija commented 1 year ago

Please send to matija@luxonis.com and jan.cuhel@luxonis.com and we will investigate.

KFStrata commented 1 year ago

I have sent you a copy of the .pt weights.

HonzaCuhel commented 1 year ago

Hi @KFStrata,

thank you for sending us the weights. I have looked at them and found out that the reason, why the conversion is failing is because the Detection head (Detect class [link]) is missing cls_preds attribute, so my question to you is, have you modified in any way the model?

Best, Jan

KFStrata commented 1 year ago

Hi Jan,

I have not. I will send another version to you today to confirm if you are seeing this issue in the latest model I am running. I am not familiar with the attribute you have mentioned.

tersekmatija commented 1 year ago

Could you also share the training code with us @KFStrata ?

HonzaCuhel commented 1 year ago

Hi @KFStrata,

thank you for sending us the weights again. I've investigate the weights and found out that your model is using effidehead_fuseab head link, whereas our converter expects effidehead link. I've compared those 2 Detection heads and found out effidehead_fuseab instead of cls_preds and reg_preds uses cls_preds_af, reg_preds_af, cls_preds_ab, and reg_preds_ab. During inference only cls_preds_af, reg_preds_af are used. Therefore, I tried to locally edit the conversion and managed to successfully generated a .blob file. Here is the generated .blob file. Could you please try it and confirm whether it is working correctly?

Best, Jan

KFStrata commented 1 year ago

I have passed this on to be tested.

Do you know how this change came to be? I am currently setting up a new computer to train on and am more than happy to share the steps I follow to get it working as this will cause some headaches fro me if I don't get it solved. Steps followed below:

Install Ubuntu 22.04 Install Cuda 12.1 Install CUDNN for 12*

Create a python venv Follow these install instructions https://techzizou.com/install-cuda-and-cudnn-on-windows-and-linux/#linux modifying to suit my system I can't install gcc +6 this time so I am going for a workaround - different hardware different rules etc

Now i follow my solution above to get it training using yolov6 rev 0.3.0. I am not editing any of the main code bar the one script above I mention.

HonzaCuhel commented 1 year ago

Thank you! Please let us know once you know the results!

Do you know how this change came to be?

Well, YoloV6 repo offers several different versions of model, so I suppose you chose the version that uses the effidehead_fuseab. Could you please share with us the specific name of the model that you used for training?

Best, Jan

KFStrata commented 1 year ago

Yolov6n

HonzaCuhel commented 1 year ago

Ah, I see. Could you please also share with with us the details of the training, e.g. the exact command that you used to train the network.

Thanks, Jan

KFStrata commented 1 year ago

python3 tools/train.py --batch 32 --conf configs/yolov6n_finetune.py --data data/wof.yml --fuse_ab --device 0 --epochs 10000

HonzaCuhel commented 1 year ago

Thank you very much! I can see that you used the --fuse_ab flag that changed the detection head. As soon as you will know the results of the tests please let us know so that we could update our tools!

Thanks, Jan

HonzaCuhel commented 1 year ago

Hi @blaz-r @KFStrata ,

I apologize for the delay, we have just released a new version of tools that supports conversion of the YoloV6 R4 models.

Best, Jan

KFStrata commented 10 months ago

Hi all,

I am hitting this issue again on a new machine. I am now running multiple GPUs. Code below is what I have used to start training. Using the YOLOv6n.pt file on 4.0

python -m torch.distributed.run --nproc_per_node 8 tools/train.py --batch 128 --conf configs/yolov6n_finetune.py --data data/Foundations.yaml --fuse_ab --device 0,1,2,3,4,5,6,7

python -m torch.distributed.run --nproc_per_node 8 tools/train.py --resume

Screenshot 2024-01-05 160212 Screenshot 2024-01-05 155540 Screenshot 2024-01-05 155853

tersekmatija commented 10 months ago

Hi @KFStrata ,

Can you please share the weights with @HonzaCuhel so we can take a look? It should likely be an easy fix, my guess is the weights are stored slightly differently which makes the tools fail.

tersekmatija commented 10 months ago

If possible, it would also be good if you make a new issue since it's related to multi GPU training rather than a new release. Thanks!

KFStrata commented 10 months ago

I think I found the issue.

wget https://github.com/meituan/YOLOv6/releases/download/0.3.0/yolov6n.pt -O ./weights/yolov6n.pt

This came from one of the tutorials from yolov6 on the repo itself but I cannot find it now. I will try using the latest weights and see what happens.

KFStrata commented 10 months ago

So using the 0.4.0 weights threw the old simconv error. There is another support ticket for that and its not fixed yet. I am back to using the 0.3.0 and its running. @HonzaCuhel can I email you my .pt privately?

HonzaCuhel commented 10 months ago

Hi,

yes, please send me the .pt weights so that I can have a look. Thank you!

Best, Jan

KFStrata commented 9 months ago

Can I get your email?

HonzaCuhel commented 9 months ago

Yes, sure. My email is jan.cuhel@luxonis.com.