open-mmlab / mmcv

OpenMMLab Computer Vision Foundation
https://mmcv.readthedocs.io/en/latest/
Apache License 2.0
5.83k stars 1.63k forks source link

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. #2894

Open zbl929 opened 1 year ago

zbl929 commented 1 year ago

Prerequisite

Environment

mmseg=0.23.0 mmcv-full=1.5.0

Reproduces the problem - code sample

.

Reproduces the problem - command or script

sh dist_train.sh ../configs/next_vit/nextvit_segfomer.py 4

Reproduces the problem - error message

I have been using the custom data set for mmseg training, before the training process is relatively smooth. Today I will update the images in the dataset, but the dataset type is the same, only the data _ root is replaced. Error in training : _assert osp.exists(self.imgdir) and self.split is not None

Later I annotated him:

_2023-08-09 00:01:29,055 - mmseg - INFO - workflow: [('train', 1), ('val', 1)], max: 40000 iters 2023-08-09 00:01:29,055 - mmseg - INFO - Checkpoints will be saved to /home/zhangbulin/mmseg-0.23.0/tools/newlogs/nextvit_seghead by HardDiskBackend. Traceback (most recent call last): File "./train.py", line 254, in main() File "./train.py", line 243, in main train_segmentor( File "/home/zhangbulin/mmseg-0.23.0/mmseg/apis/train.py", line 175, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train data_batch = next(data_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 32, in next data = next(self.iter_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 215, in getitem return self.prepare_train_img(idx) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 232, in prepare_train_img return self.pipeline(results) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/compose.py", line 41, in call data = t(data) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/loading.py", line 61, in call img_bytes = self.file_client.get(filename) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/zhangbulin/mmseg-0.23.0/newdata_1/JPEGImages/366.jpg'

Traceback (most recent call last): File "./train.py", line 254, in main() File "./train.py", line 243, in main train_segmentor( File "/home/zhangbulin/mmseg-0.23.0/mmseg/apis/train.py", line 175, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train data_batch = next(data_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 32, in next data = next(self.iter_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next Traceback (most recent call last): File "./train.py", line 254, in data = self._next_data() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data main() File "./train.py", line 243, in main data.reraise()train_segmentor(

File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise File "/home/zhangbulin/mmseg-0.23.0/mmseg/apis/train.py", line 175, in train_segmentor runner.run(data_loaders, cfg.workflow)raise exception

File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 215, in getitem return self.prepare_train_img(idx) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 232, in prepare_train_img return self.pipeline(results) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/compose.py", line 41, in call data = t(data) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/loading.py", line 61, in call img_bytes = self.file_client.get(filename) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/zhangbulin/mmseg-0.23.0/newdata_1/JPEGImages/372.jpg'

iter_runner(iter_loaders[i], **kwargs)

File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train data_batch = next(data_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 32, in next data = next(self.iter_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 215, in getitem return self.prepare_train_img(idx) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 232, in prepare_train_img return self.pipeline(results) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/compose.py", line 41, in call data = t(data) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/loading.py", line 61, in call img_bytes = self.file_client.get(filename) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/zhangbulin/mmseg-0.23.0/newdata_1/JPEGImages/181.jpg'

Traceback (most recent call last): File "./train.py", line 254, in main() File "./train.py", line 243, in main train_segmentor( File "/home/zhangbulin/mmseg-0.23.0/mmseg/apis/train.py", line 175, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train data_batch = next(data_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 32, in next data = next(self.iter_loader) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 215, in getitem return self.prepare_train_img(idx) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/custom.py", line 232, in prepare_train_img return self.pipeline(results) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/compose.py", line 41, in call data = t(data) File "/home/zhangbulin/mmseg-0.23.0/mmseg/datasets/pipelines/loading.py", line 61, in call img_bytes = self.file_client.get(filename) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/zhangbulin/mmseg-0.23.0/newdata_1/JPEGImages/389.jpg'

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23834) of binary: /home/zhangbulin/anaconda3/envs/mmseg23/bin/python Traceback (most recent call last): File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhangbulin/anaconda3/envs/mmseg23/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launchagent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Additional information

I delete 389.jpg in FileNotFoundError : [ Errno 2 ] No such file or directory : ' / home / zhangbulin / mmseg-0.23.0 / newdata _ 1 / JPEGImages / 389.jpg ', there will be other jpg can not be found.

I am sure that my jpg file exists and is configured correctly, and I am familiar with the mmseg framework.

zhouzaida commented 1 year ago

Have you resolved this error? This error seems that some files could not be found.

Markson-Young commented 1 year ago

I got the same error, error message as follows: /HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/HOME/scw6580/.conda/envs/MMEngine/bin/python: can't open file '/var/spool/slurmd/job713328/train.py': [Errno 2] No such file or directory /HOME/scw6580/.conda/envs/MMEngine/bin/python: can't open file '/var/spool/slurmd/job713328/train.py': [Errno 2] No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3623) of binary: /HOME/scw6580/.conda/envs/MMEngine/bin/python Traceback (most recent call last): File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in main() File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/HOME/scw6580/.conda/envs/MMEngine/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/var/spool/slurmd/job713328/train.py FAILED

Failures: [1]: time : 2023-09-01_21:04:47 host : g0173.para.ai rank : 1 (local_rank: 1) exitcode : 2 (pid: 3624) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-01_21:04:47 host : g0173.para.ai rank : 0 (local_rank: 0) exitcode : 2 (pid: 3623) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

/var/spool/slurmd/job713328/slurm_script: line 21: --launcher: command not found

Markson-Young commented 1 year ago

There is no problem with single-card training, which occurs when training with two or more GPUs. Here are my training cmd: sbatch -p gpu_4090 --gpus=2 tools/dist_train.sh configs/dino/dino_4scale_r50_8xb2_12e_coco.py 2