Training output reports incorrect num examples when using DDP

System Info

AWS EC2 instance: trn1.32xlarge
OS: Ubuntu 22.04.4 LTS

Platform:

- Platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.35
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.24
- `neuron-sdk` version: 2.19.1
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.24.5
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.2335
- `neuronx-cc` version: 2.14.227.0+2d4f85be
- `neuronx-distributed` version: 0.8.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:
aws-neuronx-collectives/unknown,now 2.21.46.0-69b77134b amd64 [installed]
aws-neuronx-dkms/unknown,now 2.17.17.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.21.41.0-fb1705f5f amd64 [installed]
aws-neuronx-tools/unknown,now 2.18.3.0 amd64 [installed]

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I can't share the project code which has a dataset of type Dataset and len of 56403, but wrote another case for simplicity, that shows the same issue.

Command: torchrun --nproc_per_node=2 issue.py

Code (issue.py)

```python import torch from transformers import RobertaForCausalLM from optimum.neuron import NeuronTrainer as Trainer from optimum.neuron import NeuronTrainingArguments as TrainingArguments class CustomDataset(torch.utils.data.Dataset): def __getitem__(self, index): return { "input_ids": torch.randint(0, 50265, (512,)), "labels": torch.randint(0, 50265, (512,)) } def __len__(self): return 56403 dataset = CustomDataset() model = RobertaForCausalLM.from_pretrained("roberta-base") training_args = TrainingArguments(output_dir="./model", max_steps=100) trainer = Trainer( model=model, args=training_args, train_dataset=dataset ) trainer.train() # note the output line: "[INFO|trainers.py:] >> Num examples = "" # the issue is at https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 # currently "self.num_examples(train_dataloader)" = 28208 # should maybe be "self.num_examples(train_dataloader._loader)" = 56403 (expected) ```

When calling trainer.train(), we get the output:

[INFO|trainers.py:] <timestamp> >> ***** Running training *****
[INFO|trainers.py:] <timestamp> >>   Num examples = 28,208
...

Num examples should be 56403, but it returns a different number when using DDP, like 28208 (in this test).

"Num examples" is calculated by Trainer's num_examples() in https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1408 which is called by https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 .

The issue doesn't occur when training without DDP. Without DDP, dataloader is and num_examples() returns expected number.

With DDP, dataloader is and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a and len(dataloader._loader.dataset) is 56403. Perhaps we should call self.num_examples(train_dataloader._loader)?

Expected behavior

"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.

For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is and "num examples" is reported as expected: 56403.

huggingface / optimum-neuron