Open Abdurrahheem opened 7 months ago
Problem disappears when moveing to cpu
(from mps
) backend.
Facing the same issue. Any fix as of yet? @Abdurrahheem Is there any way to make it work with mps backend?
I don't seem to face the same issue, which Python version are you on? I am on Python 3.11, Torch 2.0.1, MacOS 14.4.1 (M1 Pro).
All I get is this warning when generating:
warning: loc("mps_not_equal"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":253:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
but the program works, and I get a meaningful output.
I don't seem to face the same issue, which Python version are you on? I am on Python 3.11, Torch 2.0.1, MacOS 14.4.1 (M1 Pro).
All I get is this warning when generating:
warning: loc("mps_not_equal"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":253:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384]. warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384]. warning: loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): 'anec.not_equal_zero' op Invalid configuration for the following reasons: Tensor dimensions N1D1C1H1W50257 are not within supported range, N[1-65536]D[1-16384]C[1-65536]H[1-16384]W[1-16384].
but the program works, and I get a meaningful output.
Python version is: 3.10.10
Facing the same issue. Any fix as of yet? @Abdurrahheem Is there any way to make it work with mps backend?
May be to change torch version?
Facing the same issue. Any fix as of yet? @Abdurrahheem Is there any way to make it work with mps backend?
May be to change torch version?
By the way, I'm having issues with Torch 2.2.2, see #36
Python version is:
3.10.10
I just ran the code with Python 3.10 and Torch 2.0.1 with no issue... Just the warning I mentioned above.
Running the code with Python 3.11.4 and Torch 2.0.1 still gives me BUS error.
After a bit of investigation, issue seems to be on this line of code. https://github.com/karpathy/llm.c/blob/a08c11b60ebb1b3300113b808c9770b0ff3a21b4/train_gpt2.py#L380
The code breaks exactly at this line, and gives the Bus ERROR. Found a useful stackoverflow page explaining why this occurs.
I am running the code on a M2 MacBook Air with 8GB of RAM, but I do not think RAM is an issue, since it should have run in @Abdurrahheem 's scenario, as it is a pro with 16 Gigs of RAM.
For me, the problem is that I am running out of MPS memory, it seems, and it is trying to access memory already allocated to some other process, which is leading to the BUS error.
The two tallest spikes in GPU memory usage, are when I am running the train_gpt2.py
code through mps, which is followed by the BUS error. To me, this seems to be happening, since my MacBook is running out of GPU memory.
This re-establisthes the fact that 8 GB RAM is not sufficient for DL based applications :(
The two tallest spikes in GPU memory usage, are when I am running the train_gpt2.py code through mps, which is followed by the BUS error. To me, this seems to be happening, since my MacBook is running out of GPU memory.
This re-establisthes the fact that 8 GB RAM is not sufficient for DL based applications :(
Have you tried to run the code with a smaller batch size?
With the last commits train_gpt2.py
now includes the command line arg --batch_size
, so it's even easier to tune.
Also, AFAIK there are some methods to overcome the MPS memory limitations (MacOS will never assign the total system memory to the GPU only), like modifying the value of the env. variable PYTORCH_MPS_HIGH_WATERMARK_RATIO
.
Honestly, though, I wouldn't recommend it if it's not of vital importance to run this program, as the docs say it may cause system failures. I never tried that either, I just read about it online.
@davmacario I have tried running with --batch size
of 1, and still I am getting the bus ERROR. The second option I am a bit hesitant to use, since it seems a bit risky.
This seem to be related to pytorch. It works fine in torch==2.2.0
.
https://github.com/pytorch/pytorch/issues/112014
Explonation:
Fails while running
train_gpt2.py
after successfully downloading preatined weights with error:Error log
Library versions
System Specifications: