Closed SangbumChoi closed 2 months ago
Hi @SangbumChoi, thanks for creating a separate issue. It seems like this question has been discussed already in the following issues
1) https://github.com/huggingface/transformers/issues/28740 2) https://github.com/huggingface/transformers/issues/13197
Its mentioned in docs, but probably, worth making object detection models trainable with multi-gpu setup.
If you want to train the model in a distributed environment across multiple nodes, then one should update the num_boxes variable in the DetrLoss class of modeling_detr.py. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation here.
@qubvel Thanks for the referencing the issues. IMO those two thinks were handled by the commit https://github.com/huggingface/transformers/pull/28312/files#diff-5229d293ce9b5a88ce60b77fe0b89a5ec6240faae55381b5097424f11ac0149d
So I think we can fix this problem by debugging each value of cost_matrix let me dig in to this and let you know!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @SangbumChoi, were you able to find a fix for this?
System Info
@qubvel
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I think this might be the cause number of GPU or hyperparameter.