karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
24.44k stars 2.77k forks source link

[CUDA ERROR] at file \llm.c\train_gpt2.cu:405: too many resources requested for launch (old version does not have this issue - fyi) #106

Open ross-wheeler opened 7 months ago

ross-wheeler commented 7 months ago

[System] Device 0: NVIDIA RTX A5500 enable_tf32: 1 [GPT-2] max_seq_len: 1024 vocab_size: 50257 num_layers: 12 num_heads: 12 channels: 768 num_parameters: 124439808 train dataset num_batches: 74 val dataset num_batches: 8 batch size: 4 sequence length: 1024 val_num_batches: 10 num_activations: 2456637440 [CUDA ERROR] at file \llm.c\train_gpt2.cu:405: too many resources requested for launch

Switch back to the previous version?

msharmavikram commented 7 months ago

Change the following lines

__global__  void layernorm_forward_kernel3(float* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd,
                                    const float*  __restrict__ inp, const float*  __restrict__ weight,
                                    const float* __restrict__ bias, int N, int C) {

to below

__global__ __launch_bounds__(1024,2) 
void layernorm_forward_kernel3(float* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd,
                                    const float*  __restrict__ inp, const float*  __restrict__ weight,
                                    const float* __restrict__ bias, int N, int C) {

I believe it is running out of registers. So need to forcefully do register spills. This solved the error for me.

ross-wheeler commented 7 months ago

Thank you, Vikram @msharmavikram ! That fixed it! Should this be a generic fix or is it card/gpu specific?

msharmavikram commented 7 months ago

This may not work on all cards. I need to determine the correctness of this fix for different thread block size before submitting a PR. I didn't a chance to work on that yet.

msharmavikram commented 7 months ago

Submitted a PR #126

msharmavikram commented 7 months ago

@ross-wheeler I do not see this issue anymore on the top of the tree. Let me know if you still see the error without the proposed fix.

rosslwheeler commented 7 months ago

@msharmavikram - I still need your above fix. The top of tree did not fix my issue.