pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 152 forks source link

Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

Open chenxingyu-cs opened 1 year ago

chenxingyu-cs commented 1 year ago

🐛 Describe the bug

Hi, we found some strange during using Dataloader2. Here's some details about the issue.

Can you help provide some context on what could be the root cause and how to fix this? Thanks!

Log:


  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
  | 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
  | 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
  | 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
  | 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
  | 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
  | 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
  | 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
  | 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
  | 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
  | 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
  | 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
  | 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)

Versions

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==0.991
[pip3] mypy-boto3-batch==1.26.103
[pip3] mypy-boto3-ec2==1.26.136
[pip3] mypy-boto3-iam==1.26.97
[pip3] mypy-boto3-s3==1.26.127
[pip3] mypy-boto3-sagemaker==1.26.141
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchsnapshot-nightly==2023.3.15
[pip3] torchvision==0.15.2
[pip3] torchx-nightly==2023.5.25
[pip3] triton==2.0.0
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchsnapshot-nightly     2023.3.15                pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] torchx-nightly            2023.5.25                pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi
chenxingyu-cs commented 1 year ago

@ejguan Hi can you help provide some insights you have? Great thanks!

ejguan commented 1 year ago

Are you running multiple DPP at the same time?

chenxingyu-cs commented 1 year ago

@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.