megvii-research / MSPN

Multi-Stage Pose Network
334 stars 63 forks source link

subprocess.CalledProcessError #31

Closed Yinhance closed 3 years ago

Yinhance commented 3 years ago

When I run the command python -m torch.distributed.launch --nproc_per_node=4 train.py

Traceback (most recent call last):
  File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/yh/anaconda3/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.

How to solve this problem,thx!!!

Yinhance commented 3 years ago

The total error log:

Traceback (most recent call last):
  File "train.py", line 119, in <module>
    main()
  File "train.py", line 70, in main
    data_loader, engine.state.iteration):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in __getitem__
    raise ValueError('fail to read {}'.format(img_path))
ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000435257.jpg

2021-04-03 00:56:54 deepserver3 train[26932] INFO 

Start training with pytorch version 1.3.1
2021-04-03 00:56:54 deepserver3 train[26929] WARNING A exception occurred during Engine initialization, give up running process
Traceback (most recent call last):
  File "train.py", line 119, in <module>
    main()
  File "train.py", line 70, in main
    data_loader, engine.state.iteration):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in __getitem__
    raise ValueError('fail to read {}'.format(img_path))
ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000246589.jpg

2021-04-03 00:56:54 deepserver3 train[26930] WARNING A exception occurred during Engine initialization, give up running process
Traceback (most recent call last):
  File "train.py", line 119, in <module>
    main()
  File "train.py", line 70, in main
    data_loader, engine.state.iteration):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in __getitem__
    raise ValueError('fail to read {}'.format(img_path))
ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000259761.jpg

2021-04-03 00:56:54 deepserver3 train[26932] WARNING A exception occurred during Engine initialization, give up running process
Traceback (most recent call last):
  File "train.py", line 119, in <module>
    main()
  File "train.py", line 70, in main
    data_loader, engine.state.iteration):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in __getitem__
    raise ValueError('fail to read {}'.format(img_path))
ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/train2014/COCO_train2014_000000310013.jpg

Traceback (most recent call last):
  File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/yh/anaconda3/bin/python', '-u', 'train.py', '--local_rank=3']' returned non-zero exit status 1
fenglinglwb commented 3 years ago

It seems that the JointsDataset.py failed to read images from COCO. Please check whether you provide correct image urls.

Yinhance @.***>于2021年4月3日 周六上午9:18写道:

The total error log: `Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000435257.jpg

2021-04-03 00:56:54 deepserver3 train[26932] INFO

Start training with pytorch version 1.3.1 2021-04-03 00:56:54 deepserver3 train[26929] WARNING A exception occurred during Engine initialization, give up running process Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000246589.jpg

2021-04-03 00:56:54 deepserver3 train[26930] WARNING A exception occurred during Engine initialization, give up running process Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000259761.jpg

2021-04-03 00:56:54 deepserver3 train[26932] WARNING A exception occurred during Engine initialization, give up running process Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/train2014/COCO_train2014_000000310013.jpg

Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/yh/anaconda3/bin/python', '-u', 'train.py', '--local_rank=3']' returned non-zero exit status 1.`

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/megvii-detection/MSPN/issues/31#issuecomment-812771882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4GMAYJOQ2BN5R55CLBULLTGZUHDANCNFSM42JUZZYA .

Yinhance commented 3 years ago

It seems that the JointsDataset.py failed to read images from COCO. Please check whether you provide correct image urls. Yinhance @.**>于2021年4月3日 周六上午9:18写道: The total error log: `Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000435257.jpg 2021-04-03 00:56:54 deepserver3 train[26932] INFO Start training with pytorch version 1.3.1 2021-04-03 00:56:54 deepserver3 train[26929] WARNING A exception occurred during Engine initialization, give up running process Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000246589.jpg 2021-04-03 00:56:54 deepserver3 train[26930] WARNING A exception occurred during Engine initialization, give up running process Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/val2014/COCO_val2014_000000259761.jpg 2021-04-03 00:56:54 deepserver3 train[26932] WARNING A exception occurred during Engine initialization, give up running process Traceback (most recent call last): File "train.py", line 119, in main() File "train.py", line 70, in main data_loader, engine.state.iteration): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/MSPN/dataset/JointsDataset.py", line 73, in getitem raise ValueError('fail to read {}'.format(img_path)) ValueError: fail to read /home/yh/MSPN/dataset/COCO/images/train2014/COCO_train2014_000000310013.jpg Traceback (most recent call last): File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main*", mod_spec) File "/home/yh/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/yh/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/yh/anaconda3/bin/python', '-u', 'train.py', '--local_rank=3']' returned non-zero exit status 1.` — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4GMAYJOQ2BN5R55CLBULLTGZUHDANCNFSM42JUZZYA .

Got it! The subprocess.CalledProcessError isn't individual,the error before it is decisive~