open-mmlab / mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark
https://mmpretrain.readthedocs.io/en/latest/
Apache License 2.0
3.41k stars 1.06k forks source link

[Bug] AssertionError: Download failed or shared storage is unavailable (in distributed training setting) #1188

Open austinmw opened 1 year ago

austinmw commented 1 year ago

Branch

1.x branch (1.0.0rc2 or other 1.x version)

Describe the bug

Training on a single instance worked fine, but when I try to train with 2 nodes I get the error:

[1,mpirank:0,algo-1]:Extracting data/cifar10/cifar-10-python.tar.gz to data/cifar10 [1,mpirank:9,algo-2]:Traceback (most recent call last): [1,mpirank:9,algo-2]: File "/opt/conda/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg [1,mpirank:9,algo-2]: obj = obj_cls(args) # type: ignore [1,mpirank:9,algo-2]: File "/opt/mmclassification/mmcls/datasets/cifar.py", line 65, in init [1,mpirank:9,algo-2]: super().init( [1,mpirank:9,algo-2]: File "/opt/mmclassification/mmcls/datasets/base_dataset.py", line 92, in init [1,mpirank:9,algo-2]: super().init( [1,mpirank:9,algo-2]: File "/opt/conda/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 247, in init [1,mpirank:9,algo-2]: self.full_init() [1,mpirank:9,algo-2]: File "/opt/mmclassification/mmcls/datasets/base_dataset.py", line 169, in full_init [1,mpirank:9,algo-2]: super().full_init() [1,mpirank:9,algo-2]: File "/opt/conda/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 298, in full_init [1,mpirank:9,algo-2]: self.data_list = self.load_data_list() [1,mpirank:9,algo-2]: File "/opt/mmclassification/mmcls/datasets/cifar.py", line 98, in load_data_list [1,mpirank:9,algo-2]: assert self._check_integrity(), \ [1,mpirank:10,algo-2]:Traceback (most recent call last): [1,mpirank:10,algo-2]: File "/opt/conda/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg [1,mpirank:10,algo-2]: obj = obj_cls(args) # type: ignore [1,mpirank:10,algo-2]: File "/opt/mmclassification/mmcls/datasets/cifar.py", line 65, in init [1,mpirank:10,algo-2]: super().init( [1,mpirank:10,algo-2]: File "/opt/mmclassification/mmcls/datasets/base_dataset.py", line 92, in init [1,mpirank:10,algo-2]: super().init( [1,mpirank:10,algo-2]: File "/opt/conda/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 247, in init [1,mpirank:10,algo-2]: self.full_init() [1,mpirank:10,algo-2]: File "/opt/mmclassification/mmcls/datasets/base_dataset.py", line 169, in full_init [1,mpirank:10,algo-2]: super().full_init() [1,mpirank:10,algo-2]: File "/opt/conda/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 298, in full_init [1,mpirank:10,algo-2]: self.data_list = self.load_data_list() [1,mpirank:10,algo-2]: File "/opt/mmclassification/mmcls/datasets/cifar.py", line 98, in load_data_list [1,mpirank:10,algo-2]: assert self._check_integrity(), \ [1,mpirank:10,algo-2]:AssertionError: Download failed or shared storage is unavailable. Please download the dataset manually through https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz.

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1,2,3', 'Tesla V100-SXM2-16GB'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 11.3, V11.3.109'), ('GCC', 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0'), ('PyTorch', '1.12.1+cu113'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201402\n - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.3\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.5.2\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.13.1+cu113'), ('OpenCV', '4.6.0'), ('MMEngine', '0.2.0')])

Other information

I've tried CIFAR10 both with automatic downloading setting as well as manually downloading and providing path.

Ezra-Yu commented 1 year ago

did you start using slurm?

austinmw commented 1 year ago

Not slurm, but am using MPI

mzr1996 commented 1 year ago

Do you have shared storage on all ranks? If the storage is not shared, other ranks cannot access the dataset downloaded by the master rank. You can download and extract the dataset manually from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and place it in the dataset path.

austinmw commented 1 year ago

@mzr1996 I don't have shared storage. The data is streamed from S3 (using SageMaker's FastFile mode)

"FastFile mode – SageMaker exposes a dataset residing in Amazon S3 as a POSIX file system on the training instance. Dataset files are streamed from Amazon S3 on demand as your training script reads them."

ref: https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/

This seems to work fine with distributed training in MMDetection 3.x. Not sure how it operates differently with MMClassification

mzr1996 commented 1 year ago

Can MMDetection automatically download datasets to aws s3? Please tell me the dataset type, and I will check it.

austinmw commented 1 year ago

Oh sorry, so the two things I tried were:

  1. Using CIFAR10 with default paths and attempting to auto download (no FastFile mode, instead, downloading to non-shared storage location (default config path for cifar10 download))

  2. Using CIFAR10 but downloading manually and uploading to S3, then using FastFile mode:

mkdir cifar10; cd cifar10
wget -q https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz
aws s3 sync cifar10 s3://path/to/data/cifar10