Inconsistent metrics about LSTR reproduce on CULane dataset

sunpeng981712364 commented 2 years ago

I see the metrics of LSTR reported without mixed presicison. Do you know the reason? @kalkun @voldemortX @cedricgsh @LittleJohnKhan

voldemortX commented 2 years ago

@sunpeng981712364 I did not experience such a significant performance drop before, so you might want to check if the same drop happens for other methods. However, I do have some clues about why mixed precision fails. For LSTR, it has the projection-aware 3rd order polynomial, combined with L1 loss and Hungarian matching, the precision requirement is very high, sometimes I even fail to complete training due to gradient explosion.

voldemortX commented 2 years ago

So by intuition (and the fact that mixed precision does not really accelerate training with this small network), we always use full precision for these kind of tricky methods, sometimes including RESA (recently, @LittleJohnKhan experienced gradient explosion when using mixed precision for RESA-ERFNet).

sunpeng981712364 commented 2 years ago

thanks for your reply. LSTR model train config is as follows: * Trained on a 1080 Ti cluster, with CUDA 9.0 PyTorch 1.3, training time is estimated as: single 2080 Ti, mixed precision. I trained one model without mixed precision. but the performance still have gap. my train config is V100, Pytorch 1.6. I don't know whether it's the root cause. @voldemortX

voldemortX commented 2 years ago

@sunpeng981712364 which config file did you use and what was your shell script?

sunpeng981712364 commented 2 years ago

@voldemortX https://github.com/voldemortX/pytorch-auto-drive/blob/master/tools/shells/resnet34_lstr-aug_culane.sh

python main_landet.py --train --config=configs/lane_detection/lstr/resnet34_culane_aug.py
# Predicting lane points for testing
python main_landet.py --test --config=configs/lane_detection/lstr/resnet34_culane_aug.py
# Testing with official scripts
./autotest_culane.sh resnet34_lstr_culane-aug test checkpoints

voldemortX commented 2 years ago

@sunpeng981712364 And what was your results?

sunpeng981712364 commented 2 years ago

@voldemortX

experiment name: resnet34_culane-aug-f32-reproduce
status: test
save dir: checkpoints
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test0_normal.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 24602 fp: 5758 fn: 8175
finished process file
precision: 0.810343
recall: 0.750587
Fmeasure: 0.779321
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test1_crowd.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 15357 fp: 7135 fn: 12646
finished process file
precision: 0.682776
recall: 0.548406
Fmeasure: 0.608258
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test2_hlight.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 770 fp: 439 fn: 915
finished process file
precision: 0.63689
recall: 0.456973
Fmeasure: 0.532135
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test3_shadow.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 1398 fp: 891 fn: 1478
finished process file
precision: 0.610747
recall: 0.486092
Fmeasure: 0.541336
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test4_noline.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 3467 fp: 2444 fn: 10554
finished process file
precision: 0.586534
recall: 0.247272
Fmeasure: 0.347883
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test5_arrow.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 2037 fp: 594 fn: 1145
finished process file
precision: 0.77423
recall: 0.640163
Fmeasure: 0.700843
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test6_curve.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 563 fp: 370 fn: 749
finished process file
precision: 0.60343
recall: 0.429116
Fmeasure: 0.501559
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test7_cross.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 0 fp: 647 fn: 0
no ground truth positive
finished process file
precision: 0
recall: -1
Fmeasure: 0
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test8_night.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 9807 fp: 3782 fn: 11223
finished process file
precision: 0.721687
recall: 0.466334
Fmeasure: 0.566567
----------------------------------

F1 score: 62.716843187970284
Precision: 72.44600991743796
Recall: 55.29908662738592

sunpeng981712364 commented 2 years ago

@sunpeng981712364 And what was your results?

sorry for that.

voldemortX commented 2 years ago

That does seem weird. Could you check your config file against the current master branch, then try download the pre-trained model and test it?

sunpeng981712364 commented 2 years ago

@voldemortX thanks very much. I have downloaded provided trained model and get the same metrics.

sunpeng981712364 commented 2 years ago

By the way, your implemented LSTR has higher metrics than the author provide on the culane dataset. Great Works!

voldemortX commented 2 years ago

@sunpeng981712364 This means the testing process has no bugs. I've also checked the training process, it is exactly the same as before the BC-Breaks. It could be the original commit recorded wrong training hyperparameters, let me check some old logs.

voldemortX commented 2 years ago

@sunpeng981712364 I may have some guesses but training LSTR takes really a long time and I currently don't have the GPU. Will you try replace its step lr_scheduler by this one?

lr_scheduler = dict(
    name='poly_scheduler_with_warmup',
    epochs=150,
    power=0.9,
    warmup_steps=200
)

Nevertheless, there is no doubt the original 3 random runs achieved that accuracy.

sunpeng981712364 commented 2 years ago

@voldemortX it needs 150 epochs? why I trained 12 epochs.

voldemortX commented 2 years ago

@voldemortX it needs 150 epochs? why I trained 12 epochs.

Really? I thought you used the master branch code which has 150 epochs? https://github.com/voldemortX/pytorch-auto-drive/blob/4f6527660ef3e285e9bb92f374f495f33e32216a/configs/lane_detection/lstr/resnet34_culane_aug.py#L9

If that is the case, please train 150 with the step scheduler and see if it still can't be reproduced.

sunpeng981712364 commented 2 years ago

@voldemortX it needs 150 epochs? why I trained 12 epochs.

Really? I thought you used the master branch code which has 150 epochs?

https://github.com/voldemortX/pytorch-auto-drive/blob/4f6527660ef3e285e9bb92f374f495f33e32216a/configs/lane_detection/lstr/resnet34_culane_aug.py#L9

https://github.com/voldemortX/pytorch-auto-drive/blob/4f6527660ef3e285e9bb92f374f495f33e32216a/configs/lane_detection/lstr/resnet34_culane_aug.py#L28 1650640554.710604","[12, 3951] loss curve aux0: 0.1923" "1650640554.710606","[12, 3951] loss upper aux0: 0.0207" "1650640554.710608","[12, 3951] loss lower aux0: 0.0169" "1650640554.71061","[12, 4395] training loss: 2.6497" "1650640554.710612","[12, 4395] loss label: 0.1068" "1650640554.710615","[12, 4395] loss curve: 0.1858" "1650640554.710617","[12, 4395] loss upper: 0.0198" "1650640554.710619","[12, 4395] loss lower: 0.0159" "1650640554.710621","[12, 4395] training loss aux0: 1.3288" "1650640554.710623","[12, 4395] loss label aux0: 0.1008" "1650640554.710625","[12, 4395] loss curve aux0: 0.1907" "1650640554.710627","[12, 4395] loss upper aux0: 0.0201" "1650640554.710629","[12, 4395] loss lower aux0: 0.0164" "1650640554.710631","Epoch time: 1336.53s" "1650640554.710633","Files saved at: ./checkpoints/resnet34_lstr_culane-aug." "1650640554.710635","Tensorboard log at: ./checkpoints/tb_logs/resnet34_lstr_culane-aug" from the log. 12 epochs trained.

voldemortX commented 2 years ago

@sunpeng981712364 Thanks for spotting this bug! It is caused by the refactoring in v3.0. Sorry for the troubles. Change that 12 to 150 and see if you can reproduce the results? I'll update the master branch in a minute.

voldemortX commented 2 years ago

By the way, your implemented LSTR has higher metrics than the author provide on the culane dataset. Great Works!

Thanks. That is mainly because LSTR is really under-tuned in CULane, while in contrast, most other methods are significantly over-tuned.

sunpeng981712364 commented 2 years ago

you are welcome. very appreciate for your timely help. I will post my result after training and then discuss it with you.

sunpeng981712364 commented 2 years ago

@voldemortX careful design can be seen in this repo. give some advices about the repo. visualization is not enough, just like learning rate, metrics, predcition and labels on images.

voldemortX commented 2 years ago

give some advices about the repo. visualization is not enough, just like learning rate, metrics, predcition and labels on images.

We'll improve vis in the future. Although it might be in a lower priority right now.

sunpeng981712364 commented 2 years ago

experiment name: resnet34_culane-aug-32-e150
status: test
save dir: checkpoints
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test0_normal.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 28221 fp: 2786 fn: 4556
finished process file
precision: 0.910149
recall: 0.861
Fmeasure: 0.884893
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test1_crowd.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 18007 fp: 5986 fn: 9996
finished process file
precision: 0.750511
recall: 0.643038
Fmeasure: 0.69263
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test2_hlight.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 941 fp: 428 fn: 744
finished process file
precision: 0.687363
recall: 0.558457
Fmeasure: 0.616241
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test3_shadow.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 1530 fp: 795 fn: 1346
finished process file
precision: 0.658065
recall: 0.531989
Fmeasure: 0.588348
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test4_noline.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 4621 fp: 3001 fn: 9400
finished process file
precision: 0.606271
recall: 0.329577
Fmeasure: 0.42702
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test5_arrow.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 2514 fp: 353 fn: 668
finished process file
precision: 0.876875
recall: 0.790069
Fmeasure: 0.831212
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test6_curve.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 671 fp: 279 fn: 641
finished process file
precision: 0.706316
recall: 0.511433
Fmeasure: 0.59328
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test7_cross.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 0 fp: 970 fn: 0
no ground truth positive
finished process file
precision: 0
recall: -1
Fmeasure: 0
----------------------------------
------------Configuration---------
anno_dir: /mnt/training/pengsun/datasets/culane/
detect_dir: ../../output/
im_dir: /mnt/training/pengsun/datasets/culane/
list_im_file: /mnt/training/pengsun/datasets/culane/list/test_split/test8_night.txt
width_lane: 30
iou_threshold: 0.5
im_width: 1640
im_height: 590
-----------------------------------
Evaluating the results...
tp: 12113 fp: 3498 fn: 8917
finished process file
precision: 0.775927
recall: 0.575987
Fmeasure: 0.661172
----------------------------------

F1 score: 71.62135012080434
Precision: 79.13139746753696
Recall: 65.42150525332265

@voldemortX F1 0.86% lower that your number, I think the result is acceptable. Thanks for your help and the issue will be closed.

voldemortX / pytorch-auto-drive

Inconsistent metrics about LSTR reproduce on CULane dataset #77