swc-17 / SparseDrive

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation
MIT License
354 stars 42 forks source link

classification loss equal 0 when training. #30

Closed PeterJaq closed 2 months ago

PeterJaq commented 2 months ago

Hi, I use release stage 1 weight finetune model on nuscenes. But I found all cls loss is equal 0

mmdet - INFO - Iter [51/87900] lr: 8.000e-05, eta: 2 days, 5:49:16, time: 2.206, data_time: 0.223, memory: 27280, det_loss_cls_0: 0.0000, det_loss_box_0: 0.9162, det_loss_cns_0: 0.6314, det_loss_yns_0: 0.0632, det_loss_cls_1: 0.0000, det_loss_box_1: 0.5736, det_loss_cns_1: 0.5961, det_loss_yns_1: 0.0322, det_loss_cls_2: 0.0000, det_loss_box_2: 0.5553, det_loss_cns_2: 0.5914, det_loss_yns_2: 0.0306, det_loss_cls_3: 0.0000, det_loss_box_3: 0.5486, det_loss_cns_3: 0.5904, det_loss_yns_3: 0.0299, det_loss_cls_4: 0.0000, det_loss_box_4: 0.5399, det_loss_cns_4: 0.5885, det_loss_yns_4: 0.0281, det_loss_cls_5: 0.0000, det_loss_box_5: 0.5376, det_loss_cns_5: 0.5885, det_loss_yns_5: 0.0285, map_loss_cls_0: 0.0000, map_loss_line_0: 0.6851, map_loss_cls_1: 0.0000, map_loss_line_1: 0.7943, map_loss_cls_2: 0.0000, map_loss_line_2: 0.7175, map_loss_cls_3: 0.0000, map_loss_line_3: 0.9981, map_loss_cls_4: 0.0000, map_loss_line_4: 0.9802, map_loss_cls_5: 0.0000, map_loss_line_5: 0.9968, loss_dense_depth: 0.4719, loss: 13.1137, grad_norm: 12.8054

Do you have any experience on that?

swc-17 commented 2 months ago

We did not encounter this problem. Did you make some modification in config or the code?

PeterJaq commented 2 months ago

We did not encounter this problem. Did you make some modification in config or the code?

Solved! The training container build and install mmcv on rtx3080. And we transfer this container to A100, the training loss is wrong. We check this issue on mmcv focal loss api, any input to sigmoid focal loss the output is 0. So we reinstall mmcv on a100, the problem solved. Thank you.

yk112233 commented 2 months ago

@PeterJaq , I encounter the same problem. So do you reinstall mmcv_full with another version in the training container in a100 to solve the problem?