pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.13k stars 22.08k forks source link

Rare segfault probably related to JIT and multiprocessing #63865

Closed twoertwein closed 3 years ago

twoertwein commented 3 years ago

🐛 Bug

I'm sorry, I do currently not have a reproducible example of the following issue. I use a custom dask computation graph (created with dask.delayed) to run a gridsearch in parallel, in the current case one a single compute node using dask.distributed.LocalCluster. Each task instantiates a torch.nn.Moduleand JITs it with torch.jit.script . Very infrequently, I get the following stack trace that points to https://github.com/pytorch/pytorch/blob/release/1.9/torch/_jit_internal.py#L95

Fatal Python error: Segmentation fault

Thread 0x00007fd5e6a0c700 (most recent call first):
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/maps-yYEsxs9t-py3.9/lib/python3.9/site-packages/torch/_jit_internal.py", line 95 in <lambda>
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/maps-yYEsxs9t-py3.9/lib/python3.9/site-packages/torch/jit/_recursive.py", line 355 in create_methods_and_properties_from_stubs
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/maps-yYEsxs9t-py3.9/lib/python3.9/site-packages/torch/jit/_recursive.py", line 478 in create_script_module_impl
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/maps-yYEsxs9t-py3.9/lib/python3.9/site-packages/torch/jit/_recursive.py", line 412 in create_script_module
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/maps-yYEsxs9t-py3.9/lib/python3.9/site-packages/torch/jit/_script.py", line 1096 in script

Environment

Collecting environment information... PyTorch version: 1.9.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~16.04) 9.4.0 Clang version: 3.8.0-2ubuntu4 (tags/RELEASE_380/final) CMake version: version 3.15.1 Libc version: glibc-2.23

Python version: 3.9.5+ (heads/3.9:0796e21, Jun 25 2021, 16:35:43) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-4.13.0-36-generic-x86_64-with-glibc2.23 Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti Nvidia driver version: 440.64 cuDNN version: /usr0/local/cuda-9.0/lib64/libcudnn.so.7.0.5 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] mypy==0.910 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.20.3 [pip3] torch==1.9.0+cu102 [pip3] torchvision==0.10.0+cu102 [conda] Could not collect

cc @gmagogsfm

nikithamalgifb commented 3 years ago

@ twoertwein: Without a reproducer it's very difficult to understand what's going wrong. If you could provide us with a small reproducer, it would help us to. help you faster. For now, I will go ahead and close this issue. Please feel free to reopen this issue when you have a reproducer.