Closed twoertwein closed 3 years ago
@ twoertwein: Without a reproducer it's very difficult to understand what's going wrong. If you could provide us with a small reproducer, it would help us to. help you faster. For now, I will go ahead and close this issue. Please feel free to reopen this issue when you have a reproducer.
🐛 Bug
I'm sorry, I do currently not have a reproducible example of the following issue. I use a custom dask computation graph (created with
dask.delayed
) to run a gridsearch in parallel, in the current case one a single compute node usingdask.distributed.LocalCluster
. Each task instantiates atorch.nn.Module
and JITs it withtorch.jit.script
. Very infrequently, I get the following stack trace that points to https://github.com/pytorch/pytorch/blob/release/1.9/torch/_jit_internal.py#L95Environment
Collecting environment information... PyTorch version: 1.9.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.7 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~16.04) 9.4.0 Clang version: 3.8.0-2ubuntu4 (tags/RELEASE_380/final) CMake version: version 3.15.1 Libc version: glibc-2.23
Python version: 3.9.5+ (heads/3.9:0796e21, Jun 25 2021, 16:35:43) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-4.13.0-36-generic-x86_64-with-glibc2.23 Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti Nvidia driver version: 440.64 cuDNN version: /usr0/local/cuda-9.0/lib64/libcudnn.so.7.0.5 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] mypy==0.910 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.20.3 [pip3] torch==1.9.0+cu102 [pip3] torchvision==0.10.0+cu102 [conda] Could not collect
cc @gmagogsfm