The issue doesn't occur when training without DDP. Without DDP, dataloader is and num_examples() returns expected number.
With DDP, dataloader is and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a and len(dataloader._loader.dataset) is 56403. Perhaps we should call self.num_examples(train_dataloader._loader)?
Expected behavior
"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.
For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is and "num examples" is reported as expected: 56403.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I can't share the project code which has a dataset of type Dataset and len of 56403, but wrote another case for simplicity, that shows the same issue.
Command:
torchrun --nproc_per_node=2 issue.py
Code (issue.py)
```python import torch from transformers import RobertaForCausalLM from optimum.neuron import NeuronTrainer as Trainer from optimum.neuron import NeuronTrainingArguments as TrainingArguments class CustomDataset(torch.utils.data.Dataset): def __getitem__(self, index): return { "input_ids": torch.randint(0, 50265, (512,)), "labels": torch.randint(0, 50265, (512,)) } def __len__(self): return 56403 dataset = CustomDataset() model = RobertaForCausalLM.from_pretrained("roberta-base") training_args = TrainingArguments(output_dir="./model", max_steps=100) trainer = Trainer( model=model, args=training_args, train_dataset=dataset ) trainer.train() # note the output line: "[INFO|trainers.py:When calling trainer.train(), we get the output:
Num examples should be 56403, but it returns a different number when using DDP, like 28208 (in this test).
"Num examples" is calculated by Trainer's num_examples() in https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1408 which is called by https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 .
The issue doesn't occur when training without DDP. Without DDP, dataloader is and num_examples() returns expected number.
With DDP, dataloader is and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a and len(dataloader._loader.dataset) is 56403. Perhaps we should call
self.num_examples(train_dataloader._loader)
?Expected behavior
"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.
For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is and "num examples" is reported as expected: 56403.