karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
24.6k stars 2.79k forks source link

Token out of vocabulary at train_gpt2.cu:675 #786

Open aidando73 opened 1 week ago

aidando73 commented 1 week ago

I'm trying to follow https://github.com/karpathy/llm.c/discussions/481 but I'm getting this error:

evaluating HellaSwag: 30/79
evaluating HellaSwag: 40/79
evaluating HellaSwag: 50/79
evaluating HellaSwag: 60/79
evaluating HellaSwag: 70/79
Writing state to log124M/state_00019560_00002.bin
Error: Token out of vocabulary at train_gpt2.cu:675
Error details:
  File: train_gpt2.cu
  Line: 675
  Token: -1149026846
  Position: 0
  Vocab: 50257
generating:
---
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20376,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Happens at the end of training. I don't end up getting the final model weights.

Running:

nice nohup bash -c 'echo "start $(date)" && mpirun -np 8 ./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -y 1 \
    -v 250 -s 20000 \
    -h 1 && echo "end $(date)"' &

You can find the 1500 model checkpoint + state here: https://huggingface.co/aidando73/repro-gpt-2-124M/tree/086c8895ae49f2472bcde14c7866e792b0a330f1/8x_A100_40GB/log124M

Commit hash I checked out: 7ecd8906afe6ed7a2b2cdb731c042f26d525b820

Note that I didn't run python train_gpt2.py beforehand.

Anyone else getting this error?

aidando73 commented 1 week ago

Note that I didn't run python train_gpt2.py beforehand.

When I was using traing_gpt2.cu for inference, I ran into the same issue. But if I ran python train_gpt2.py beforehand I no longer ran into the issue.

My hypothesis is that -1149026846 is the end of file token that we're not setting correctly for the case where we don't run python train_gpt2.py.