pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 153 forks source link

FSSpecFileOpenerIterDataPipe raises a `NoCredentialsError` on large dataloader `num_worker` #906

Open kiukchung opened 2 years ago

kiukchung commented 2 years ago

🐛 Describe the bug

There are two issues (both are reproducible using the script below):

  1. FSSpecFileOpenerIterDataPipe gets stuck if one tries to iteratively create DataLoader(num_workers=0, ...) then DataLoader(num_workers=greater_than_zero). Practically speaking this isn't much of an issue since typically a trainer will create the dataloader once but for benchmarking this means that we can't iterate benchmark runs that change dataloader num_workers from the same parent process.
  2. NoCredentialsError when using FSSpecFileOpenerIterDataPipe with large (>64) dataloader num_workers.

Repro Script

from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterableWrapper

if __name__ == "__main__":
    print(f"=== BEGIN REPRO TEST ===")

    data_s3url = "s3://<REPLACE_WITH_YOUR_S3_URL>"
    # workers = [0, 1, 2] # <-- use this to repro stuckness. You'll observe that the loop below will get stuck when i=1
    workers = [1, 2, 4, 8, 16, 32, 48, 64, 128]
    for i in workers:
        dataset = (
            IterableWrapper([data_s3url])
            .list_files_by_fsspec()
            .open_files_by_fsspec()
            .readlines(return_path=False)
        )
        try:
            for batch in DataLoader(
                dataset,
                batch_size=max(workers) * 2,
                num_workers=i,
            ):
                break
            print(f"Succeeded running with num_workers={i}")
        except Exception as e:
            print(f"Error running with num_workers={i}. Exception: {e}")

    print(f"=== END REPRO TEST ===")

Exception

Traceback (most recent call last):
  File "/home/ubuntu/workspace/mfive/mfive/examples/data/repro.py", line 28, in <module>
    for batch in DataLoader(
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/_utils.py", line 460, in reraise
    raise RuntimeError(msg) from None
RuntimeError: Caught NoCredentialsError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/datapipe.py", line 344, in __iter__
    yield from self._datapipe
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torchdata/datapipes/iter/util/plain_text_reader.py", line 121, in __iter__
    for path, file in self.source_datapipe:
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torchdata/datapipes/iter/load/fsspec.py", line 137, in __iter__
    for file_uri in self.source_datapipe:
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torchdata/datapipes/iter/load/fsspec.py", line 85, in __iter__
    for file_name in fs.ls(path):
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/fsspec/asyn.py", line 71, in sync
    raise return_result
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/s3fs/core.py", line 810, in _ls
    files = await self._lsdir(path, refresh)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/s3fs/core.py", line 593, in _lsdir
    async for i in it:
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/paginate.py", line 32, in __anext__
    response = await self._make_request(current_kwargs)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/client.py", line 173, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/client.py", line 193, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/endpoint.py", line 77, in _send_request
    request = await self.create_request(request_dict, operation_model)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/endpoint.py", line 70, in create_request
    await self._event_emitter.emit(event_name, request=request,
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/hooks.py", line 27, in _emit
    response = await handler(**kwargs)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/signers.py", line 16, in handler
    return await self.sign(operation_name, request)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/signers.py", line 63, in sign
    auth.add_auth(request)
  File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/botocore/auth.py", line 378, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
This exception is thrown by __iter__ of FSSpecFileListerIterDataPipe(kwargs={}, masks='')

Versions

  1. torchdata-0.4.1
  2. torch-1.12.1
  3. fsspec-2022.1.0
  4. s3fs-2022.1.0
ejguan commented 2 years ago

Thanks for opening the issue. I am able to reproduce the first issue that the subsequent DataLoader would be stuck when the prior DataLoader has num_workers=0. I am investigating it now.

However, I am not able to reproduce the second issue. I am not sure why larger number of processes would interfere the credential. I suspect this is tied to boto3 not handling racing on credential read correctly.

ejguan commented 2 years ago

For Issue 1, I am currently able to boil down to fork starting method. If I set multiprocessing_context="spawn" or "forkserver" for the subsequent DataLoader, there won't be hanging issue.

tensorcopy commented 1 year ago

Hi, regarding issue 2, I am using FSSpecFileOpenerIterDataPipe with 22 workers to load data from s3. I am also getting NoCredentialsError. Any progress on this issue? Thanks! (also with 4 GPUs)

kunimatsu-tri commented 8 months ago

I ran into the same issue and found that setting AWS_METADATA_SERVICE_NUM_ATTEMPTS helps mitigate the issue as mentioned here.