pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.15k stars 6.94k forks source link

ToTensor deadlock in subprocess #7068

Closed Cathy0908 closed 1 year ago

Cathy0908 commented 1 year ago

🐛 Describe the bug

Describe the bug

I load a checkpoint in main process, then I start a subprocess to call ToTensor and the subprocess hang forever.

the following codes will reproduce my problem:

import numpy as np
from PIL import Image
from multiprocessing import Process
import torchvision
from torchvision.transforms import ToTensor

a = np.ones((224, 224, 3), dtype=np.uint8)
a = Image.fromarray(a, mode='RGB')
model = torchvision.models.resnet50(True)
p = Process(target=ToTensor(), args=(a,))
p.start()
p.join()
print('end')

Versions

torch version: 1.12.0 torchvision version: 0.13.0

NicolasHug commented 1 year ago

I'm unable to reproduce your issue @Cathy0908 . The script properly terminates on my env. Could you please share your env's details, as requested on the issue template?

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Cathy0908 commented 1 year ago

Collecting environment information... PyTorch version: 1.12.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Alibaba Group Enterprise Linux Server 7.2 (Paladin) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39) Clang version: Could not collect CMake version: version 3.22.3 Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-3.10.0-327.ali2008.odps.alios7.x86_64-x86_64-with-redhat-7.2-Paladin Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB GPU 4: Tesla V100-SXM2-16GB GPU 5: Tesla V100-SXM2-16GB GPU 6: Tesla V100-SXM2-16GB GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 440.33.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.20.0 [pip3] pytorch-metric-learning==0.9.89 [pip3] torch==1.12.0 [pip3] torchvision==0.13.0 [conda] numpy 1.20.0 pypi_0 pypi [conda] pytorch-metric-learning 0.9.89 pypi_0 pypi [conda] torch 1.12.0 pypi_0 pypi [conda] torchvision 0.13.0 pypi_0 pypi

Cathy0908 commented 1 year ago

Even the following codes will hang:

from multiprocessing import Process

a = torch.zeros(size=(1000, 1000))

def f():
    torch.zeros(size=(1000, 1000))

p = Process(target=f)
p.start()
p.join()

print('====end====')

but if torch.set_num_threads(1), it will terminate. torch dataloder also set num_threads to 1 when num_workers>0, refer to: torch.utilss.data._utils.worker._worker_loop

NicolasHug commented 1 year ago

Thanks for the details @Cathy0908 . From your last snippet, this issue seems unrelated to torchvision as it happens even when it's not imported. I'd recommend 1) checking if you observe the same issue without torch, e.g. def f(): print("Hello") and 2) report the issue in https://github.com/pytorch/pytorch/issues if the issue seems to be pytorch-specific