Open ross-wheeler opened 7 months ago
Change the following lines
__global__ void layernorm_forward_kernel3(float* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd,
const float* __restrict__ inp, const float* __restrict__ weight,
const float* __restrict__ bias, int N, int C) {
to below
__global__ __launch_bounds__(1024,2)
void layernorm_forward_kernel3(float* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd,
const float* __restrict__ inp, const float* __restrict__ weight,
const float* __restrict__ bias, int N, int C) {
I believe it is running out of registers. So need to forcefully do register spills. This solved the error for me.
Thank you, Vikram @msharmavikram ! That fixed it! Should this be a generic fix or is it card/gpu specific?
This may not work on all cards. I need to determine the correctness of this fix for different thread block size before submitting a PR. I didn't a chance to work on that yet.
Submitted a PR #126
@ross-wheeler I do not see this issue anymore on the top of the tree. Let me know if you still see the error without the proposed fix.
@msharmavikram - I still need your above fix. The top of tree did not fix my issue.
[System] Device 0: NVIDIA RTX A5500 enable_tf32: 1 [GPT-2] max_seq_len: 1024 vocab_size: 50257 num_layers: 12 num_heads: 12 channels: 768 num_parameters: 124439808 train dataset num_batches: 74 val dataset num_batches: 8 batch size: 4 sequence length: 1024 val_num_batches: 10 num_activations: 2456637440 [CUDA ERROR] at file \llm.c\train_gpt2.cu:405: too many resources requested for launch
Switch back to the previous version?