Different behavior observed of trainer class on using allenai/ai2_arc dataset and own data which is a subset of allenai/ai2_arc

System Info

transformers version: 4.44.2
Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
Python version: 3.9.19
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: yes
Using GPU in script?: yes
GPU type: NVIDIA A100 80GB PCIe

Who can help?

Hi, @muellerzr @SunMarc Hi, I have been trying to fine-tune OPT-model 1.3b on a subset of allenai/ai2_arc dataset ('./data/10_low_p3-opt-125M.arrow' in the code )using 4 GPUs. The code works fine if I use the complete data (train_dataset = load_dataset("allenai/ai2_arc", "ARC-Challenge")[split] ) but when I try to train on a subset (train_dataset = load_from_disk('./data/10_low_p3-opt-125M.arrow')), the optimizer step gives the following error: File line 169, in trainer.train() File "python39/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "python39/lib/python3.9/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop self.optimizer.step() File "python39/lib/python3.9/site-packages/accelerate/optimizer.py", line 172, in step self.optimizer.step(closure) File "python39/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "python39/lib/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, *kwargs) File "python39/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, args, kwargs) File "python39/lib/python3.9/site-packages/torch/optim/adamw.py", line 184, in step adamw( File "/net/scratch/lcpandia/python39/lib/python3.9/site-packages/torch/optim/adamw.py", line 335, in adamw func( File "python39/lib/python3.9/site-packages/torch/optim/adamw.py", line 509, in _multi_tensor_adamw grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([ File "python39/lib/python3.9/site-packages/torch/optim/optimizer.py", line 397, in _group_tensors_by_device_and_dtype return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) File "python39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "python39/lib/python3.9/site-packages/torch/utils/_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items() RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding** My minimal code to reproduce it is attached as a zip file:

testTrainerDeviceIssue.zip

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

testTrainerDeviceIssue.zip Attached as a zip file torchrun --nproc_per_node=4 --master_port=<> testTrainerDeviceIssue.py \ --model_name_or_path facebook/opt-1.3b \ --data_path 10_low_p0-opt-125M.arrow \ --bf16 True \ --output_dir / \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \ --tf32 True \ --seed 42 \ --gradient_checkpointing True

Expected behavior

The model training should work as it runs in case of directly using allenai/ai2_arc ARC-Challenge dataset

huggingface / transformers