Open austinmw opened 2 years ago
did you start using slurm?
Not slurm, but am using MPI
Do you have shared storage on all ranks? If the storage is not shared, other ranks cannot access the dataset downloaded by the master rank. You can download and extract the dataset manually from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and place it in the dataset path.
@mzr1996 I don't have shared storage. The data is streamed from S3 (using SageMaker's FastFile mode)
"FastFile mode – SageMaker exposes a dataset residing in Amazon S3 as a POSIX file system on the training instance. Dataset files are streamed from Amazon S3 on demand as your training script reads them."
This seems to work fine with distributed training in MMDetection 3.x. Not sure how it operates differently with MMClassification
Can MMDetection automatically download datasets to aws s3? Please tell me the dataset type, and I will check it.
Oh sorry, so the two things I tried were:
Using CIFAR10 with default paths and attempting to auto download (no FastFile mode, instead, downloading to non-shared storage location (default config path for cifar10 download))
Using CIFAR10 but downloading manually and uploading to S3, then using FastFile mode:
mkdir cifar10; cd cifar10
wget -q https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz
aws s3 sync cifar10 s3://path/to/data/cifar10
Branch
1.x branch (1.0.0rc2 or other 1.x version)
Describe the bug
Training on a single instance worked fine, but when I try to train with 2 nodes I get the error:
Environment
Other information
I've tried CIFAR10 both with automatic downloading setting as well as manually downloading and providing path.