pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.64k stars 22.57k forks source link

Error: command buffer exited with error status. #125954

Open dbl001 opened 5 months ago

dbl001 commented 5 months ago

🐛 Describe the bug

I am training llama2.c on an iMac 27" with an AMD Radeon Pro 5700 XT GPU. There are no recent nightly builds for MacOS + x86_64, so I built Pytorch from source. I got this exception at epoch 11,580. I was able to resume training and haven't gotten the error again. Each epoch typically take ~2500 ms, however, when I got the exception, the epoch's were taking much longer (E.g. - 64903.14ms)

step 11500: train loss 3.2412, val loss 5.7422
saving checkpoint to out
wrote out/model.bin
11500 | loss 7.6908 | lr 2.899000e-05 | 3647545.51ms | mfu 0.45%
11510 | loss 7.5400 | lr 2.895835e-05 | 65127.72ms | mfu 0.40%
11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%
Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
    <GFX10_MtlCmdBuffer: 0x7f7bed7a9800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1
Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
    <GFX10_MtlCmdBuffer: 0x7f7bd219b800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1
Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
    <GFX10_MtlCmdBuffer: 0x7f7bd219b800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1

...

Could GPU time-out errors be caused during garbage collection? Something else?

Versions

% python collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.10.13 (main, Sep 11 2023, 08:21:04) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

Versions of relevant libraries:
[pip3] audiolm-pytorch==0.0.1
[pip3] configmypy==0.1.0
[pip3] mypy==1.4.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.17.1
[pip3] optree==0.11.0
[pip3] pytorch-transformers==1.1.0
[pip3] tensorly-torch==0.4.0
[pip3] torch==2.2.2
[pip3] torch-cluster==1.6.1
[pip3] torch-harmonics==0.6.5
[pip3] torch-scatter==2.1.1
[pip3] torch-sparse==0.6.17
[pip3] torch-spline-conv==1.2.2
[pip3] torch-struct==0.5
[pip3] torch-summary==1.4.5
[pip3] torch-utils==0.1.2
[pip3] torchaudio==2.2.2
[pip3] torchdata==0.7.1
[pip3] torchtext==0.17.2
[pip3] torchtraining-nightly==1604016577
[pip3] torchvision==0.17.2
[pip3] triton==2.1.0
[pip3] vector-quantize-pytorch==0.9.2
[conda] mkl                       2023.2.1                 pypi_0    pypi
[conda] nomkl                     3.0                           0  
[conda] numpy                     1.26.4          py310hf6dca73_0  
[conda] numpy-base                1.26.4          py310hd8f4981_0  
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] tensorly-torch            0.4.0                    pypi_0    pypi
[conda] torch                     2.4.0a0+git409b1a6          pypi_0    pypi
[conda] torch-cluster             1.6.1                    pypi_0    pypi
[conda] torch-harmonics           0.6.5                    pypi_0    pypi
[conda] torch-scatter             2.1.1                    pypi_0    pypi
[conda] torch-sparse              0.6.17                   pypi_0    pypi
[conda] torch-spline-conv         1.2.2                    pypi_0    pypi
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torch-summary             1.4.5                    pypi_0    pypi
[conda] torch-utils               0.1.2                    pypi_0    pypi
[conda] torchaudio                2.2.2                    pypi_0    pypi
[conda] torchdata                 0.7.1                    pypi_0    pypi
[conda] torchtext                 0.17.2                   pypi_0    pypi
[conda] torchtraining-nightly     1604016577               pypi_0    pypi
[conda] torchvision               0.17.2                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   0.9.2                    pypi_0    pypi

cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @kulinseth @DenisVieriu97 @jhavukainen

malfet commented 5 months ago

Can you provide some sort of minimal reproducer? llama2.c to the best of my knowledge does not use PyTorch in any way (nor uses GPU acceleration)

dbl001 commented 5 months ago

llama2.c uses PyTorch when training models. The inference part (e.g. 'run.c') does NOT use PyTorch. https://github.com/karpathy/llama2.c

Here's an example of the training process using the tinystories dataset.

$ python tinystories.py download
$ python tinystories.py train_vocab --vocab_size=4096
$ python tinystories.py pretokenize --vocab_size=4096
$ python train.py --vocab_source=custom --vocab_size=4096

I used a dataset generated from COVID-19 research papers. https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge/data

The exception was generated when training a Llama2 model with 12 layers and 12 heads setting device='mps', from 801,915 research papers. The exception only happened once during trainin 25,000 epochs.

Screenshot 2024-05-10 at 7 01 28 PM Screenshot 2024-05-11 at 7 56 21 AM

output.txt

Do you know what could cause this exception? (e.g. - garbage collection taking too long?) Why the long times (highlighted in BOLD):

11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
**11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%**
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
**11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%**

I built PyTorch with USE_MINALLOC set to TRUE. Could this explain the delays?

ashwani-rathee commented 3 months ago

Similar Bug on M2pro while running: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html, I would love to help if someone can guide:

Model MaskRNN error summary

Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /Users/ash/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
100.0%
Error: command buffer exited with error status.
        The Metal Performance Shaders operations encoded on it may not have completed.
        Error: 
        (null)
        Internal Error (0000000e:Internal Error)
        <AGXG14XFamilyCommandBuffer: 0x1563c4fa0>
    label = <none> 
    device = <AGXG14SDevice: 0x126aefe00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
        label = <none> 
        device = <AGXG14SDevice: 0x126aefe00>
            name = Apple M2 Pro 
    retainedReferences = 1
Error: command buffer exited with error status.
        The Metal Performance Shaders operations encoded on it may not have completed.
        Error: 
        (null)
        Internal Error (0000000e:Internal Error)
        <AGXG14XFamilyCommandBuffer: 0x130b2b880>
    label = <none> 
    device = <AGXG14SDevice: 0x126aefe00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
        label = <none> 
        device = <AGXG14SDevice: 0x126aefe00>
            name = Apple M2 Pro 
    retainedReferences = 1
Error: command buffer exited with error status.
        The Metal Performance Shaders operations encoded on it may not have completed.
        Error: 
        (null)
        Internal Error (0000000e:Internal Error)
        <AGXG14XFamilyCommandBuffer: 0x39806d500>
    label = <none> 
    device = <AGXG14SDevice: 0x126aefe00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
        label = <none> 
        device = <AGXG14SDevice: 0x126aefe00>
            name = Apple M2 Pro 
    retainedReferences = 1
Error: command buffer exited with error status.
        The Metal Performance Shaders operations encoded on it may not have completed.
        Error: 
        (null)
        Internal Error (0000000e:Internal Error)
        <AGXG14XFamilyCommandBuffer: 0x397b22db0>
    label = <none> 
    device = <AGXG14SDevice: 0x126aefe00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
        label = <none> 
        device = <AGXG14SDevice: 0x126aefe00>
            name = Apple M2 Pro 
    retainedReferences = 1
Error: command buffer exited with error status.
        The Metal Performance Shaders operations encoded on it may not have completed.
        Error: 
        (null)
        Internal Error (0000000e:Internal Error)
        <AGXG14XFamilyCommandBuffer: 0x397b22db0>
    label = <none> 
    device = <AGXG14SDevice: 0x126aefe00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
        label = <none> 
        device = <AGXG14SDevice: 0x126aefe00>
            name = Apple M2 Pro 
    retainedReferences = 1
Epoch: [0]  [ 0/60]  eta: 1:05:45  lr: 0.000090  loss: 2.1025 (2.1025)  loss_classifier: 0.6907 (0.6907)  loss_box_reg: 0.4100 (0.4100)  loss_mask: 0.9692 (0.9692)  loss_objectness: 0.0301 (0.0301)  loss_rpn_box_reg: 0.0024 (0.0024)  time: 65.7636  data: 0.0313
Error: command buffer exited with error status.
        The Metal Performance Shaders operations encoded on it may not have completed.
        Error: 
        (null)
        Internal Error (0000000e:Internal Error)
        <AGXG14XFamilyCommandBuffer: 0x3980e3080>
    label = <none> 
    device = <AGXG14SDevice: 0x126aefe00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
        label = <none> 
        device = <AGXG14SDevice: 0x126aefe00>
            name = Apple M2 Pro 
    retainedReferences = 1
^C^C^C^CTraceback (most recent call last):
  File "/Users/ash/projects/maskrcnn/./final.py", line 62, in <module>
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
  File "/Users/ash/projects/maskrcnn/engine.py", line 52, in train_one_epoch
    optimizer.step()
  File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/sgd.py", line 80, in step
    sgd(params_with_grad,
  File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/sgd.py", line 245, in sgd
    func(params,
  File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/sgd.py", line 286, in _single_tensor_sgd
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
KeyboardInterrupt

Collect Env

PyTorch version: 2.3.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.29.6
Libc version: N/A

Python version: 3.9.19 | packaged by conda-forge | (main, Mar 20 2024, 12:55:20)  [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M2 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[conda] No relevant packages