huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
205 stars 60 forks source link

Training output reports incorrect num examples when using DDP #683

Open syl-taylor-aws opened 2 months ago

syl-taylor-aws commented 2 months ago

System Info

AWS EC2 instance: trn1.32xlarge
OS: Ubuntu 22.04.4 LTS

Platform:

- Platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.35
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.24
- `neuron-sdk` version: 2.19.1
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.24.5
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.2335
- `neuronx-cc` version: 2.14.227.0+2d4f85be
- `neuronx-distributed` version: 0.8.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:
aws-neuronx-collectives/unknown,now 2.21.46.0-69b77134b amd64 [installed]
aws-neuronx-dkms/unknown,now 2.17.17.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.21.41.0-fb1705f5f amd64 [installed]
aws-neuronx-tools/unknown,now 2.18.3.0 amd64 [installed]

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

I can't share the project code which has a dataset of type Dataset and len of 56403, but wrote another case for simplicity, that shows the same issue.

Command: torchrun --nproc_per_node=2 issue.py

Code (issue.py) ```python import torch from transformers import RobertaForCausalLM from optimum.neuron import NeuronTrainer as Trainer from optimum.neuron import NeuronTrainingArguments as TrainingArguments class CustomDataset(torch.utils.data.Dataset): def __getitem__(self, index): return { "input_ids": torch.randint(0, 50265, (512,)), "labels": torch.randint(0, 50265, (512,)) } def __len__(self): return 56403 dataset = CustomDataset() model = RobertaForCausalLM.from_pretrained("roberta-base") training_args = TrainingArguments(output_dir="./model", max_steps=100) trainer = Trainer( model=model, args=training_args, train_dataset=dataset ) trainer.train() # note the output line: "[INFO|trainers.py:] >> Num examples = "" # the issue is at https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 # currently "self.num_examples(train_dataloader)" = 28208 # should maybe be "self.num_examples(train_dataloader._loader)" = 56403 (expected) ```

When calling trainer.train(), we get the output:

[INFO|trainers.py:] <timestamp> >> ***** Running training *****
[INFO|trainers.py:] <timestamp> >>   Num examples = 28,208
...

Num examples should be 56403, but it returns a different number when using DDP, like 28208 (in this test).

"Num examples" is calculated by Trainer's num_examples() in https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1408 which is called by https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 .

The issue doesn't occur when training without DDP. Without DDP, dataloader is and num_examples() returns expected number.

With DDP, dataloader is and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a and len(dataloader._loader.dataset) is 56403. Perhaps we should call self.num_examples(train_dataloader._loader)?

Expected behavior

"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.

For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is and "num examples" is reported as expected: 56403.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.