huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.87k stars 26.5k forks source link

linear_sum_assignment error in the object_detection.py guide #31461

Closed SangbumChoi closed 2 months ago

SangbumChoi commented 3 months ago

System Info

root@fb9fa1e6d8d8:/mnt/nas2/users/sbchoi/transformers/examples/pytorch/object-detection# transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.42.0.dev0
- Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.2.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA TITAN RTX
Traceback (most recent call last):
  File "/mnt/nas2/users/sbchoi/transformers/examples/pytorch/object-detection/run_object_detection.py", line 521, in <module>
    main()
  File "/mnt/nas2/users/sbchoi/transformers/examples/pytorch/object-detection/run_object_detection.py", line 496, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/trainer.py", line 1903, in train
    return inner_training_loop(
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/trainer.py", line 2248, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/trainer.py", line 3275, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/trainer.py", line 3307, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/models/detr/modeling_detr.py", line 1485, in forward
    loss_dict = criterion(outputs_loss, labels)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/models/detr/modeling_detr.py", line 2084, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/models/detr/modeling_detr.py", line 2213, in forward
    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(cost_matrix.split(sizes, -1))]
  File "/mnt/nas2/users/sbchoi/transformers/src/transformers/models/detr/modeling_detr.py", line 2213, in <listcomp>
    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(cost_matrix.split(sizes, -1))]
IndexError: index 8 is out of bounds for dimension 0 with size 8

@qubvel

Who can help?

No response

Information

Tasks

Reproduction

python run_object_detection.py \
    --model_name_or_path facebook/detr-resnet-50 \
    --dataset_name cppe-5 \
    --do_train true \
    --do_eval true \
    --output_dir detr-finetuned-cppe-5-10k-steps \
    --num_train_epochs 100 \
    --image_square_size 600 \
    --fp16 true \
    --learning_rate 5e-5 \
    --weight_decay 1e-4 \
    --dataloader_num_workers 4 \
    --dataloader_prefetch_factor 2 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --remove_unused_columns false \
    --eval_do_concat_batches false \
    --ignore_mismatched_sizes true \
    --metric_for_best_model eval_map \
    --greater_is_better true \
    --load_best_model_at_end true \
    --logging_strategy epoch \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --save_total_limit 2 \
    --push_to_hub true \
    --push_to_hub_model_id detr-finetuned-cppe-5-10k-steps \
    --hub_strategy end \
    --seed 1337

Expected behavior

I think this might be the cause number of GPU or hyperparameter.

qubvel commented 3 months ago

Hi @SangbumChoi, thanks for creating a separate issue. It seems like this question has been discussed already in the following issues

1) https://github.com/huggingface/transformers/issues/28740 2) https://github.com/huggingface/transformers/issues/13197

Its mentioned in docs, but probably, worth making object detection models trainable with multi-gpu setup.

If you want to train the model in a distributed environment across multiple nodes, then one should update the num_boxes variable in the DetrLoss class of modeling_detr.py. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation here.

SangbumChoi commented 3 months ago

@qubvel Thanks for the referencing the issues. IMO those two thinks were handled by the commit https://github.com/huggingface/transformers/pull/28312/files#diff-5229d293ce9b5a88ce60b77fe0b89a5ec6240faae55381b5097424f11ac0149d

So I think we can fix this problem by debugging each value of cost_matrix let me dig in to this and let you know!

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

parakh08 commented 2 months ago

Hi @SangbumChoi, were you able to find a fix for this?