karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
24.33k stars 2.75k forks source link

Bus ERROR while running `train_gpt2.py` #35

Open Abdurrahheem opened 7 months ago

Abdurrahheem commented 7 months ago

Explonation:

Fails while running train_gpt2.py after successfully downloading preatined weights with error:

Error log

python3 train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/tiny_shakespeare_val.bin
[1]    15017 bus error  python3 train_gpt2.py

Library versions

tokenizers               0.15.2
torch                    2.0.1
torch-geometric          2.0.3
torchvision              0.15.2

System Specifications:

Platform: MacBook Pro
Chip: M2 
Memory: 16GB
MacOS: 14.4.1
Abdurrahheem commented 7 months ago

Problem disappears when moveing to cpu (from mps) backend.

this-is-batman commented 7 months ago

Facing the same issue. Any fix as of yet? @Abdurrahheem Is there any way to make it work with mps backend?

davmacario commented 7 months ago

I don't seem to face the same issue, which Python version are you on? I am on Python 3.11, Torch 2.0.1, MacOS 14.4.1 (M1 Pro).

All I get is this warning when generating:

warning: loc("mps_not_equal"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":253:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].

but the program works, and I get a meaningful output.

Abdurrahheem commented 7 months ago

I don't seem to face the same issue, which Python version are you on? I am on Python 3.11, Torch 2.0.1, MacOS 14.4.1 (M1 Pro).

All I get is this warning when generating:

warning: loc("mps_not_equal"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":253:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].

but the program works, and I get a meaningful output.

Python version is: 3.10.10

Abdurrahheem commented 7 months ago

Facing the same issue. Any fix as of yet? @Abdurrahheem Is there any way to make it work with mps backend?

May be to change torch version?

davmacario commented 7 months ago

Facing the same issue. Any fix as of yet? @Abdurrahheem Is there any way to make it work with mps backend?

May be to change torch version?

By the way, I'm having issues with Torch 2.2.2, see #36

davmacario commented 7 months ago

Python version is: 3.10.10

I just ran the code with Python 3.10 and Torch 2.0.1 with no issue... Just the warning I mentioned above.

this-is-batman commented 7 months ago

Running the code with Python 3.11.4 and Torch 2.0.1 still gives me BUS error.

this-is-batman commented 7 months ago

After a bit of investigation, issue seems to be on this line of code. https://github.com/karpathy/llm.c/blob/a08c11b60ebb1b3300113b808c9770b0ff3a21b4/train_gpt2.py#L380

The code breaks exactly at this line, and gives the Bus ERROR. Found a useful stackoverflow page explaining why this occurs.

I am running the code on a M2 MacBook Air with 8GB of RAM, but I do not think RAM is an issue, since it should have run in @Abdurrahheem 's scenario, as it is a pro with 16 Gigs of RAM.

this-is-batman commented 7 months ago

For me, the problem is that I am running out of MPS memory, it seems, and it is trying to access memory already allocated to some other process, which is leading to the BUS error.

Screenshot 2024-04-12 at 12 39 31 AM

The two tallest spikes in GPU memory usage, are when I am running the train_gpt2.py code through mps, which is followed by the BUS error. To me, this seems to be happening, since my MacBook is running out of GPU memory.

This re-establisthes the fact that 8 GB RAM is not sufficient for DL based applications :(

davmacario commented 6 months ago

The two tallest spikes in GPU memory usage, are when I am running the train_gpt2.py code through mps, which is followed by the BUS error. To me, this seems to be happening, since my MacBook is running out of GPU memory.

This re-establisthes the fact that 8 GB RAM is not sufficient for DL based applications :(

Have you tried to run the code with a smaller batch size? With the last commits train_gpt2.py now includes the command line arg --batch_size, so it's even easier to tune.

davmacario commented 6 months ago

Also, AFAIK there are some methods to overcome the MPS memory limitations (MacOS will never assign the total system memory to the GPU only), like modifying the value of the env. variable PYTORCH_MPS_HIGH_WATERMARK_RATIO.

Honestly, though, I wouldn't recommend it if it's not of vital importance to run this program, as the docs say it may cause system failures. I never tried that either, I just read about it online.

this-is-batman commented 6 months ago

@davmacario I have tried running with --batch size of 1, and still I am getting the bus ERROR. The second option I am a bit hesitant to use, since it seems a bit risky.

raghavcd commented 6 months ago

This seem to be related to pytorch. It works fine in torch==2.2.0. https://github.com/pytorch/pytorch/issues/112014