Open yitongh opened 10 months ago
@vanbasten23 do you have any idea?
hi @yitongh , what's your cuda version nvcc --version
?
@vanbasten23 My cuda version is 11.8. Driver version is 470.154.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
I tried your script in my cuda 12.1 container. I have:
root@xiowei-gpu-1:/ansible# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
root@xiowei-gpu-1:/ansible# nvidia-smi
Thu Jan 4 01:08:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
but I couldnt reproduce the same error. I got a different error: https://gist.github.com/vanbasten23/fb50968127a5d17f0441753b80bcac5b
Mind sharing your stacktrace?
My stacktrace is https://gist.github.com/yitongh/b82508236a8e1336f049abebfe7c6e0e BTW, this error seems to occur just after starting the warmup phase in the torch profiler. It could be related to the torch version. I'm using the latest version of PyTorch with commit id f6dfbffb3bb46ada6fe66b5da4f989f9d4d69b3c.
I wonder if it's a cuda error. With cuda 12.1, I don't see the error and your code runs further.
Using
torch.profiler.profile
intest/test_train_mp_imagenet.py
can result in CUDA_ERROR_ILLEGAL_ADDRESS. git diff test/test_train_mp_imagenet.pyCommand:
PJRT_DEVICE=CUDA torchrun --nnodes 1 --nproc_per_node 2 test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1