Loss goes to nan - Githubissues

SMDIndomitable commented 1 month ago

Hi, I tried training a model with the training script, but I either arrive at loss: nan or very bad F1-score. Do you know what could be reason? Also, is there any pretrained weights I can use to test?

huanidz commented 1 month ago

@SMDIndomitable Can you provide your training command ? By the way, how is your dataset ? (like num samples, diversity, etc).

One reason can cause the NaN issue is precision type. Mine code is expected to work with NVIDIA Ampere architecture and above. If your GPU architecture are not in the range then please reply this again with the information, i will guide you to update the code.

Sorry, i couldn't provide the pretrained weights right now due to privacy related stuffs. For efficiently training, you could try to use pretrained backbones from timm and adapt it to this code's architecture.

SMDIndomitable commented 1 month ago

Hello, thank you for the quick reply.

I am just using this command: python train.py --data path/to/dataset_folder/

My dataset is: Eval: roughly 3.5k Training: roughly 9.5k

GPU: H100

I managed to get it training today with non nan issue. However the F1 score is quite low, is this suppose to be normal?

After training for 22 epochs,

[TRAIN] IoU: 0.0240, F1_Score: 0.0408 [TEST ] IoU: 0.0295, F1_Score: 0.0501

Also, sorry but what is timm?

huanidz commented 1 month ago

@SMDIndomitable That seems like you're training from scratch with the default architecture from the paper. In the paper, they used a small dataset. If i remember correctly, the author expect the loss to be around 2.0-3.0 to be considered as close to converge (other scale's loss may vary a little bit).

timm is a repo where there are a lot of pretrained image models that is trained across many foundation datasets.

My training procedure and final model was using a backbone from timm. For this repo, I did train a working model from scratch without any modification but it take so long to converge.

If you insist to use this repo, please consider let it train for bit longer or use different scale/input size.

SMDIndomitable commented 1 month ago

I see, do you still remember how many epochs did you trained it for, for it to converge?

If I were to use a pretrained image models, will it work on this repo?

Lastly, what would be a good input size, should I follow yolo's 640x640?

Thank you so much, really appreciate your insights on this!

SMDIndomitable commented 1 month ago

Epoch 2/200

Batch 011, Loss: 38.4052, Time: 1501.4329 ms, LR: 0.0003 Batch 021, Loss: 35.8887, Time: 1365.1630 ms, LR: 0.0003 Batch 031, Loss: 32.6055, Time: 1346.5857 ms, LR: 0.0003 Batch 041, Loss: 34.3496, Time: 1372.9401 ms, LR: 0.0003 Batch 051, Loss: 34.1344, Time: 1379.9658 ms, LR: 0.0003 Batch 061, Loss: 30.9654, Time: 1339.9464 ms, LR: 0.0003 Batch 071, Loss: 35.8543, Time: 1368.0040 ms, LR: 0.0003 Batch 081, Loss: 32.0710, Time: 1375.1762 ms, LR: 0.0003 Batch 091, Loss: 34.5599, Time: 1363.6372 ms, LR: 0.0003 Batch 101, Loss: 33.0427, Time: 1351.2963 ms, LR: 0.0003 Batch 111, Loss: 31.3551, Time: 1442.6491 ms, LR: 0.0003 Batch 121, Loss: 33.5702, Time: 1375.8085 ms, LR: 0.0003 Batch 131, Loss: 32.3437, Time: 1341.2880 ms, LR: 0.0003 Batch 141, Loss: nan, Time: 1373.5526 ms, LR: 0.0003 Batch 151, Loss: nan, Time: 1360.2784 ms, LR: 0.0003 Batch 161, Loss: nan, Time: 1363.4261 ms, LR: 0.0003 Batch 171, Loss: nan, Time: 1344.6392 ms, LR: 0.0003 Batch 181, Loss: nan, Time: 1374.4329 ms, LR: 0.0003 Batch 191, Loss: nan, Time: 1435.9650 ms, LR: 0.0003 Batch 201, Loss: nan, Time: 1364.9449 ms, LR: 0.0003 Batch 211, Loss: nan, Time: 1391.6659 ms, LR: 0.0003 Batch 216, Loss: nan, Time: 680.1543 ms, LR: 0.0003

Unfortunately, It when to nan for these parameters: python train.py --data /workspace/volume//License --lr 0.001 --bs 32 --size 640 --scale base

huanidz commented 1 month ago

@SMDIndomitable You can choose any size. But i recommend to lower it down since this is designed to be lightweight. I often used 256 or 384, that should be enough.

The input image should be the bounding box around the vehicle (not full image with lots of background and mix of multiple vehicles, object, etc.). Ideally this should be the output from object detection model.

On your error, i suspect it related to these lines: https://github.com/huanidz/scaled-alpr-unconstrained/blob/cb99d5b10fe6f3b969e4a3c52d944ee72cb45d5b/train.py#L104-L110

What you can try:

Try training with full precision first (float32) rather than bfloat16 and see if the error still occur. If it still, try the grad_norm stuff below. If not then there's something wrong with my code, i may look into it later.
grad norm clipping, you may got explode gradient (uncomment the #110 and adjust the max_norm to fit your training loop)

SMDIndomitable commented 1 month ago

Alright, thank you once again :D. I decided to finish the 200 epochs on these settings lr=0.001, bs=16, epochs=200, eval_after=1, size=384, scale='base', before I experiment around. I will let you know how it turned out

Edit: I read somewhere that training from scratch requires 20k epochs, is that true?

huanidz commented 1 month ago

@SMDIndomitable No, you don't need 20k epoch. That is ridiculous amount. The number of epoch is depend on the dataset size and the model itself but not that huge :smiley:

It hard to estimate for you case but i suggest keeping from 200 to 300, it should be a good starting value. If you have a good dataset, maybe 50-100 is enough.

SMDIndomitable commented 1 month ago

I see, thanks for the advice. Anyways here is the result after running 261 epochs on the small model using this lr=0.0003, bs=32, epochs=2000, eval_after=1, size=256, scale='small', resume_from=None)

[TEST ] IoU: 0.0352, F1_Score: 0.0593 Higher f1 score found. Saving model... Epoch 261/2000

I am going to continue training to see. Maybe the dataset I am using is not suitable, do you think lowering the amount of images will help? My images are mostly cropped vehicles with the license plate polygon of 4 points

huanidz commented 1 month ago

@SMDIndomitable If you can, please let me see some of your training samples (image and polygon coordinates).

SMDIndomitable commented 1 month ago

1145 1751_737 1731_737 1498_1171 1524-26_22_13_9_4_1_2_33-A_0 2356 1091_2151 1120_2149 1008_2350 988-32_23_8_1_0_7_27-A_0 20240524_213254_jpg rf 1fd53e189ffc9c7ba6d91c1bf08ba3f7_0 20240526_201804_jpg rf e3c3814a7ca23c9c94cb320e2d15fe96_0 20240605_200054_jpg rf 00e0cfbeb8604ef9ebe30d6ceee4dc2e_0

These are some of the training images, green boxes indicating the polygon

SMDIndomitable commented 1 month ago

341 776_163 777_158 676_339 680-32_25_1_9_5_14-A_0 559 1033_495 1029_495 992_563 997-16_32_7_5_0_4_12-A_0 1234 1460_565 1435_572 1042_1257 1068-26_22_15_5_7_9_3_23-A_0 2586 2053_1465 2116_1408 1938_2579 1881-26_18_29_3_7_5_1_20-A_0 3810 1542_3683 1570_3647 1436_3770 1417-15_11_21_8_2_7_1_23-A_0

These are some of the evaluation images, green boxes indicating the polygon, had to erase the license for privacy reasons

SMDIndomitable commented 1 month ago

I used these images to train for yolov8 segmentation, I do realise there are alot of environment as you mentioned. Should I recreate a dataset that only contains cropped images of the vehicle?

huanidz commented 1 month ago

@SMDIndomitable Yes, you should. And please verify the coordinate is in this format (xxxxyyyy):

# The order of 1-->4 is (x1 - y1: top left, x2 - y2: top right, x3 - y3: bottom right, x4 - y4: bottom left)
# x1, x2, x3, x4, y1, y2, y3, y4
0.497917, 0.677083, 0.670833, 0.489583, 0.734737, 0.747368, 0.844211, 0.831579

SMDIndomitable commented 1 month ago

yep, the coordinate should be correct, I drew the bounding box with the x1,x2,x3,x4,y1,y2,y3,y4 format so I think the coordinates should be correct. The numbers is in normalized form just like yolo right?

huanidz commented 1 month ago

@SMDIndomitable Different YOLOs may uses different formats like (x-center,y-center,w,h) so it depends. Basically you just need to ensure it in this repo format. Otherwise you can just write a basic script to convert.

By the way, feel free to add me at some other chat platforms like discord, skype, etc (huanidz - huannguyena2@gmail.com) if you still have questions.

SMDIndomitable commented 1 month ago

Oh, what I mean is that the values are in decimals. So I assume the values are x/ image.width and y/ image.height to obtain the decimal values. Sure, I will add you on discord if you're fine with that.

huanidz / scaled-alpr-unconstrained

Loss goes to nan #1

Epoch 2/200