Open dbl001 opened 5 months ago
Can you provide some sort of minimal reproducer? llama2.c to the best of my knowledge does not use PyTorch in any way (nor uses GPU acceleration)
llama2.c uses PyTorch when training models. The inference part (e.g. 'run.c') does NOT use PyTorch. https://github.com/karpathy/llama2.c
Here's an example of the training process using the tinystories dataset.
$ python tinystories.py download
$ python tinystories.py train_vocab --vocab_size=4096
$ python tinystories.py pretokenize --vocab_size=4096
$ python train.py --vocab_source=custom --vocab_size=4096
I used a dataset generated from COVID-19 research papers. https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge/data
The exception was generated when training a Llama2 model with 12 layers and 12 heads setting device='mps', from 801,915 research papers. The exception only happened once during trainin 25,000 epochs.
Do you know what could cause this exception? (e.g. - garbage collection taking too long?) Why the long times (highlighted in BOLD):
11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
**11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%**
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
**11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%**
I built PyTorch with USE_MINALLOC set to TRUE. Could this explain the delays?
Similar Bug on M2pro while running: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html, I would love to help if someone can guide:
Model MaskRNN error summary
Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /Users/ash/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
100.0%
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x1563c4fa0>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
retainedReferences = 1
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x130b2b880>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
retainedReferences = 1
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x39806d500>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
retainedReferences = 1
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x397b22db0>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
retainedReferences = 1
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x397b22db0>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
retainedReferences = 1
Epoch: [0] [ 0/60] eta: 1:05:45 lr: 0.000090 loss: 2.1025 (2.1025) loss_classifier: 0.6907 (0.6907) loss_box_reg: 0.4100 (0.4100) loss_mask: 0.9692 (0.9692) loss_objectness: 0.0301 (0.0301) loss_rpn_box_reg: 0.0024 (0.0024) time: 65.7636 data: 0.0313
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x3980e3080>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
commandQueue = <AGXG14XFamilyCommandQueue: 0x126b2c600>
label = <none>
device = <AGXG14SDevice: 0x126aefe00>
name = Apple M2 Pro
retainedReferences = 1
^C^C^C^CTraceback (most recent call last):
File "/Users/ash/projects/maskrcnn/./final.py", line 62, in <module>
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
File "/Users/ash/projects/maskrcnn/engine.py", line 52, in train_one_epoch
optimizer.step()
File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/sgd.py", line 80, in step
sgd(params_with_grad,
File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/sgd.py", line 245, in sgd
func(params,
File "/Users/ash/miniforge3/envs/dl/lib/python3.9/site-packages/torch/optim/sgd.py", line 286, in _single_tensor_sgd
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
KeyboardInterrupt
Collect Env
PyTorch version: 2.3.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.29.6
Libc version: N/A
Python version: 3.9.19 | packaged by conda-forge | (main, Mar 20 2024, 12:55:20) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M2 Pro
Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[conda] No relevant packages
🐛 Describe the bug
I am training llama2.c on an iMac 27" with an AMD Radeon Pro 5700 XT GPU. There are no recent nightly builds for MacOS + x86_64, so I built Pytorch from source. I got this exception at epoch 11,580. I was able to resume training and haven't gotten the error again. Each epoch typically take ~2500 ms, however, when I got the exception, the epoch's were taking much longer (E.g. - 64903.14ms)
Could GPU time-out errors be caused during garbage collection? Something else?
Versions
cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @kulinseth @DenisVieriu97 @jhavukainen