Open syl-taylor-aws opened 3 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
Same issue, DDP breaks on using IterableDataset
DistributedSampler
doesn't seem to work for IterableDataset
. Perhaps a fix might be to use split_dataset_by_node
Instead of DistributedSampler
, to do -
from datasets.distributed import split_dataset_by_node
dataset_on_curr_node = split_dataset_by_node(data_loader.dataset, rank=rank, world_size=num_replicas)
And passing this to the DataLoader
without any sampler
Feature request
Enable use of IterableDataset when training with NeuronTrainer and DDP. Or is there a design limitation that prevents this?
I can't share the project code, but see below another case for simplicity, which produces the same issue. DistributedSampler expects a dataset with known length, which a IterableDataset doesn't have by design.
Setup
OS: Ubuntu 22.04.4 LTS (kernel 6.5.0-1023-aws)
apt packages
> aws-neuronx-collectives/unknown,now 2.21.46.0-69b77134b amd64 [installed] > aws-neuronx-dkms/unknown,now 2.17.17.0 amd64 [installed] > aws-neuronx-runtime-lib/unknown,now 2.21.41.0-fb1705f5f amd64 [installed] > aws-neuronx-tools/unknown,now 2.18.3.0 amd64 [installed]pip packages
> aws-neuronx-runtime-discovery==2.9 > neuronx-cc==2.14.227.0+2d4f85be > libneuronxla==2.0.2335 > torch==2.1.2 > torch-neuronx==2.1.2.2.1.0 > torch-xla==2.1.2 > transformers==4.41.1 > accelerate==0.29.2 > optimum-neuron==0.0.24 (also tested 0.0.25.dev0)Command:
torchrun --nproc_per_node=2 issue.py
Code (issue.py)
```python import torch from transformers import RobertaForCausalLM from optimum.neuron import NeuronTrainer as Trainer from optimum.neuron import NeuronTrainingArguments as TrainingArguments class CustomIterator: def __next__(self): return { "input_ids": torch.randint(0, 50265, (512,)), "labels": torch.randint(0, 50265, (512,)) } class CustomDataset(torch.utils.data.IterableDataset): def __iter__(self): return CustomIterator() dataset = CustomDataset() model = RobertaForCausalLM.from_pretrained("roberta-base") training_args = TrainingArguments(output_dir="./model", max_steps=100) trainer = Trainer( model=model, args=training_args, train_dataset=dataset ) trainer.train() ```Issue
``` Traceback (most recent call last): File "/home/ubuntu/issue.py", line 29, inMotivation
Have a project for distributed training on Trainium with DDP that requires use of HuggingFace's IterableDataset (when
streaming=True
in load.py/load_dataset() from package datasets==2.19.0)Your contribution
N/A. I noticed on Nvidia A100 GPUs (with transformers Trainer) that it uses accelerate.data_loader.DataLoaderDispatcher and does not use DistributedSampler.