Open shamafiras opened 2 years ago
I wonder why would that happen in the middle of training ( after 20% of the training )
I don't know why the error should be raised after a few successful iterations and can't speculate on the root cause. If you are seeing the same issue using the latest stable or nightly release, please ping me once you can provide a minimal, executable code snippet so that we could debug it, please.
@ptrblck Thanks for your support. I tried running the same process on latest stable (versions below). still got the same error. Unfortunately, it would take much time to deliver a minimal code that reproduce the error. but I'll try.
Collecting environment information...
PyTorch version: 1.12.0
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15)
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.26
Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-4.14.285-215.501.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 510.73.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.12.0
[pip3] torch-model-archiver==0.5.3b20220226
[pip3] torch-workflow-archiver==0.2.4b20220513
[pip3] torchaudio==0.12.0
[pip3] torchserve==0.6.0b20220513
[pip3] torchtext==0.13.0
[pip3] torchvision==0.13.0
[conda] blas 2.115 mkl conda-forge
[conda] blas-devel 3.9.0 15_linux64_mkl conda-forge
[conda] captum 0.5.0 0 pytorch
[conda] cudatoolkit 11.6.0 hecad31d_10 conda-forge
[conda] libblas 3.9.0 15_linux64_mkl conda-forge
[conda] libcblas 3.9.0 15_linux64_mkl conda-forge
[conda] liblapack 3.9.0 15_linux64_mkl conda-forge
[conda] liblapacke 3.9.0 15_linux64_mkl conda-forge
[conda] magma-cuda116 2.6.1 0 pytorch
[conda] mkl 2022.1.0 h84fe81f_915 conda-forge
[conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge
[conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge
[conda] numpy 1.22.4 pypi_0 pypi
[conda] pytorch 1.12.0 py3.9_cuda11.6_cudnn8.3.2_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch-model-archiver 0.5.3 py39_0 pytorch
[conda] torch-workflow-archiver 0.2.4 py39_0 pytorch
[conda] torchaudio 0.12.0 py39_cu116 pytorch
[conda] torchserve 0.6.0 py39_0 pytorch
[conda] torchtext 0.13.0 py39 pytorch
[conda] torchvision 0.13.0 py39_cu116 pytorch
Hey, It took me a while, but I was able to reproduce the issue on available Colab notebook after I applied mixed precision. To reproduce the bug:
scaler = GradScaler()
somewhere aboveI hope it helps analyzing & fixing the bug, let me know if you need further info.
@soulitzer @ptrblck Any update on that ?
I encountered the same problems here. Is there any update?
I started to run in the same issue with PyTorch 2.0.0+cu117. It crashes after >300 successful forward passes.
same issue wtih PyTorch 2.1+cu118
Same issue with Pytorch 2.3.1+cu121
Bump on this. Encountered the same issue with Pytorch 1.13 + cuda116
🐛 Describe the bug
I am training model that includes 2 fully connected networks. So far I have not seen any issue with them, and I wanted to use a mixed precision training to accelerate the process. I made the required changes to the code but strangely, after several epochs of successful training and loss curve behavious, I get the following error:
I wonder why would that happen in the middle of training ( after 20% of the training )
The code is huge, but here's its basic structure:
Versions
Collecting environment information... PyTorch version: 1.9.0 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Home GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A
Python version: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19044-SP0 Is CUDA available: True CUDA runtime version: 11.3.109 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Nvidia driver version: 472.39 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.20.2 [pip3] torch==1.9.0 [pip3] torch-tb-profiler==0.2.1 [pip3] torchvision==0.10.0 [conda] blas 1.0 mkl [conda] cudatoolkit 11.1.1 heb2d755_7 conda-forge [conda] mkl 2021.2.0 haa95532_296 [conda] mkl-service 2.3.0 py38h2bbff1b_1 [conda] mkl_fft 1.3.0 py38h277e83a_2 [conda] mkl_random 1.2.1 py38hf11a4ad_2 [conda] numpy 1.20.2 py38ha4e8547_0 [conda] numpy-base 1.20.2 py38hc2deb75_0 [conda] pytorch 1.9.0 py3.8_cuda11.1_cudnn8_0 pytorch [conda] torch-tb-profiler 0.2.1 pypi_0 pypi [conda] torchvision 0.10.0 py38_cu111 pytorch
cc @mcarilli @ptrblck