open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.06k stars 9.37k forks source link

DETR training problem #10398

Closed lkydsb21 closed 1 year ago

lkydsb21 commented 1 year ago

When I trained my own dataset with DETR, I got valueerror: matrix contains invalid numeric entries, according to the previous answer, I used a pre-trained model but this problem still occurs, so I changed the inf value in the cost to -1e8, and the model can start training, but the loss value is nan. how can I fix it. 20230526_134111.log

lkydsb21 commented 1 year ago

Screenshot from 2023-05-26 14-03-54 mmdetection/mmdet/models/task_modules/assigners/hungarian_assigner.py changed the inf value in the cost to -1e8

lkydsb21 commented 1 year ago

I don't think there's a problem with my dataset because I can training on other network models such as faster rcnn and ssd

pingguokiller commented 1 year ago

The potential reason is that the model does not converge.

I melt the same problem when I use dab_detr: configs/dab_detr/dab-detr_r50_8xb2-50e_coco.py.

The error log: File "/home/zhangjw/research/research/mmdetection2023/mmdetection/mmdet/models/dense_heads/detr_head.py", line 406, in _get_targets_single assign_result = self.assigner.assign( File "/home/zhangjw/research/research/mmdetection2023/mmdetection/mmdet/models/task_modules/assigners/hungarian_assigner.py", line 135, in assign matched_row_inds, matched_col_inds = linear_sum_assignment(cost) File "/root/anaconda3/lib/python3.8/site-packages/scipy/optimize/_lsap.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries") ValueError: matrix contains invalid numeric entries

My dataset can be trained with many other models, such as Mask R-CNN, AutoAssign, and rtmdet.

At last, I found that this may be caused by two issues:

  1. I changed the optimizer from the original AdamW to SGD.
  2. I utilized the wrong mean and std of the dataset.

These issues may cause the model not to converge.

lkydsb21 commented 1 year ago

I successfully resolved this bug by re-installing mmdetection. It's worth noting that you should not modify the source code easily. I once encountered different errors because I had modified the source code before. Try to find possible issues from your own configuration file and dataset.