Closed Cathy0908 closed 1 year ago
I'm unable to reproduce your issue @Cathy0908 . The script properly terminates on my env. Could you please share your env's details, as requested on the issue template?
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Collecting environment information... PyTorch version: 1.12.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: Alibaba Group Enterprise Linux Server 7.2 (Paladin) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39) Clang version: Could not collect CMake version: version 3.22.3 Libc version: glibc-2.17
Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-3.10.0-327.ali2008.odps.alios7.x86_64-x86_64-with-redhat-7.2-Paladin Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB GPU 4: Tesla V100-SXM2-16GB GPU 5: Tesla V100-SXM2-16GB GPU 6: Tesla V100-SXM2-16GB GPU 7: Tesla V100-SXM2-16GB
Nvidia driver version: 440.33.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.20.0 [pip3] pytorch-metric-learning==0.9.89 [pip3] torch==1.12.0 [pip3] torchvision==0.13.0 [conda] numpy 1.20.0 pypi_0 pypi [conda] pytorch-metric-learning 0.9.89 pypi_0 pypi [conda] torch 1.12.0 pypi_0 pypi [conda] torchvision 0.13.0 pypi_0 pypi
Even the following codes will hang:
from multiprocessing import Process
a = torch.zeros(size=(1000, 1000))
def f():
torch.zeros(size=(1000, 1000))
p = Process(target=f)
p.start()
p.join()
print('====end====')
but if torch.set_num_threads(1)
, it will terminate.
torch dataloder also set num_threads to 1 when num_workers>0, refer to: torch.utilss.data._utils.worker._worker_loop
Thanks for the details @Cathy0908 . From your last snippet, this issue seems unrelated to torchvision as it happens even when it's not imported. I'd recommend 1) checking if you observe the same issue without torch
, e.g. def f(): print("Hello")
and 2) report the issue in https://github.com/pytorch/pytorch/issues if the issue seems to be pytorch-specific
🐛 Describe the bug
Describe the bug
I load a checkpoint in main process, then I start a subprocess to call
ToTensor
and the subprocess hang forever.the following codes will reproduce my problem:
Versions
torch version: 1.12.0 torchvision version: 0.13.0