karpathy / llama2.c

Inference Llama 2 in one file of pure C
MIT License
17.45k stars 2.09k forks source link

Segmentation fault with new models #237

Closed sergeykorablin closed 1 year ago

sergeykorablin commented 1 year ago

After last git pull && make - ./run crashes with new trained models and works fine with old ones

$ ./run out/model.bin
<s>
[1]    97991 segmentation fault (core dumped)  ./run out/model.bin
karpathy commented 1 year ago

oh no what did I do. I don't feel like I made any crazy changes and things work on my Linux box and Macbook just fine. Looking...

Update: valgrind seems happy too. How strange.

karpathy commented 1 year ago

@sergeykorablin Can you please try

make rundebug
valgrind --leak-check=full ./run out/model.bin -n 5
kroggen commented 1 year ago

What do you mean by "with new trained models"?

I just pulled the changes, built, and it is running OK

twobob commented 1 year ago

I dont see any changes that should seg. buuuut. I'm assuming mac from your repos and I'm checking on Win so probably meaningless.

sergeykorablin commented 1 year ago

@sergeykorablin Can you please try

make rundebug
valgrind --leak-check=full ./run out/model.bin -n 5
➜  llama2.c git:(master) ✗ make rundebug
gcc -g -o run run.c -lm
➜  llama2.c git:(master) ✗ valgrind --leak-check=full ./run out/model.bin -n 5
==127945== Memcheck, a memory error detector
==127945== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==127945== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==127945== Command: ./run out/model.bin -n 5
==127945== 
<s>
==127945== Invalid read of size 4
==127945==    at 0x401BCB: matmul (run.c:202)
==127945==    by 0x40251B: transformer (run.c:308)
==127945==    by 0x403310: main (run.c:549)
==127945==  Address 0xaea9000 is in a rwx anonymous segment
==127945== 
voy Ulrich(` enters Pont
achieved tok/s: 0.925069
==127945== 
==127945== HEAP SUMMARY:
==127945==     in use at exit: 0 bytes in 0 blocks
==127945==   total heap usage: 32,019 allocs, 32,019 frees, 32,251,897 bytes allocated
==127945== 
==127945== All heap blocks were freed -- no leaks are possible
==127945== 
==127945== For lists of detected and suppressed errors, rerun with: -s
==127945== ERROR SUMMARY: 80 errors from 1 contexts (suppressed: 0 from 0)
sergeykorablin commented 1 year ago

What do you mean by "with new trained models"? I just pulled the changes, built, and it is running OK

i have few model i have trained a day ago and downloaded stories110M.bin.. - they work fine models trained from scratch right now - all cause segfault

RahulSChand commented 1 year ago

Most likely seems to be an issue with your custom weight .bin file. Was the .bin file saved correctly or is it corrupted?

sergeykorablin commented 1 year ago

i reinstalled llama2.c and python venv and now it works without problem... strange

karpathy commented 1 year ago

you scared me! :)