Open kiukchung opened 2 years ago
Thanks for opening the issue.
I am able to reproduce the first issue that the subsequent DataLoader
would be stuck when the prior DataLoader
has num_workers=0
. I am investigating it now.
However, I am not able to reproduce the second issue. I am not sure why larger number of processes would interfere the credential. I suspect this is tied to boto3 not handling racing on credential read correctly.
For Issue 1, I am currently able to boil down to fork
starting method. If I set multiprocessing_context="spawn"
or "forkserver"
for the subsequent DataLoader
, there won't be hanging issue.
Hi, regarding issue 2, I am using FSSpecFileOpenerIterDataPipe
with 22 workers to load data from s3. I am also getting NoCredentialsError
. Any progress on this issue? Thanks! (also with 4 GPUs)
I ran into the same issue and found that setting AWS_METADATA_SERVICE_NUM_ATTEMPTS
helps mitigate the issue as mentioned here.
🐛 Describe the bug
There are two issues (both are reproducible using the script below):
FSSpecFileOpenerIterDataPipe
gets stuck if one tries to iteratively createDataLoader(num_workers=0, ...)
thenDataLoader(num_workers=greater_than_zero)
. Practically speaking this isn't much of an issue since typically a trainer will create the dataloader once but for benchmarking this means that we can't iterate benchmark runs that change dataloadernum_workers
from the same parent process.NoCredentialsError
when usingFSSpecFileOpenerIterDataPipe
with large (>64) dataloadernum_workers
.Repro Script
Exception
Versions
torchdata-0.4.1
torch-1.12.1
fsspec-2022.1.0
s3fs-2022.1.0