pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.71k stars 22.58k forks source link

CrossEntryopy fails silently #117532

Closed joihn closed 9 months ago

joihn commented 9 months ago

🐛 Describe the bug

when given invalid input, crossentropy fails silently.

    import torch.nn.functional as F
    device = torch.device("cuda")

    ignore_index = 255
    b = 10
    classes = 3
    w = 768
    h = 1024
    predi = torch.zeros(b, classes, w, h, dtype=torch.float16).to(device)
    labels = torch.zeros(b, w, h, dtype=torch.int64).to(device)
    labels[5, 200, 200] = 255
    labels[5, 200, 200]= 254 # invalid class

    per_pixel_losses = F.cross_entropy(
        predi, labels, reduction="none", ignore_index=ignore_index
    )
    print(per_pixel_losses)

I will fail during the print

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1699449201336/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1ac934d617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1ac930898d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1ac940a128 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f1aca37eee0 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f1aca382d08 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f1aca3995a0 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f1aca3998a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x7f1b166edbf4 in /opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f1b61404609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f1b611cf133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Versions

[pip3] adabelief-pytorch==0.2.1 [pip3] erxtorch==0.0.1 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.0 [pip3] onnx==1.15.0 [pip3] onnxruntime==1.16.1 [pip3] torch==2.1.1 [pip3] torchaudio==2.1.1 [pip3] torchelastic==0.2.2 [pip3] torchvision==0.16.1 [pip3] triton==2.1.0 [conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.0 py310h5f9d8c6_0
[conda] numpy-base 1.26.0 py310hb5e798b_0
[conda] pytorch 2.1.1 py3.10_cuda12.1_cudnn8.9.2_0 pytorch [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 2.1.1 py310_cu121 pytorch [conda] torchelastic 0.2.2 pypi_0 pypi [conda] torchtriton 2.1.0 py310 pytorch [conda] torchvision 0.16.1 py310_cu121 pytorch

Python 3.10.13 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+------------------

cc @ptrblck

ezyang commented 9 months ago

We would accept a PR that adds a CUDA runtime assert for the out of bounds access.