tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 139 forks source link

str_concat memory leak #92

Open dorjeduck opened 4 months ago

dorjeduck commented 4 months ago

I am also working on a port of llama2 to mojo. Your port is excellent, just do it myself for the sake of learning, probably sticking closer to the C source, will see how it goes. Just struggling with the tokenizer in the Andrej's C code and taking a look at your code, i wonder if there is a probably not problematic memory leak in the str_concat method, cant see right now that the memory allocated there is freed at one pint ... i might be completely wrong, just thought to drop you a line.

mikowals commented 4 months ago

Hi Martin. I think you are correct that the token memory is not explicitly freed before the program exits end the tokens that are being merged are also not freed which is probably more problematic.

There is already a fix for this on this draft PR which puts all the pointers ultimately into Strings and Lists of Strings so it all memory should be freed without calling free directly.

dorjeduck commented 4 months ago

sounds very good - i am right now wondering if i should go for Lists and Strings in my port or deal with memory management myself, need to find out if it is performance crucial but i guess its not. Thanks for the reply, great project

dorjeduck commented 4 months ago

Use a Dict to lookup token indices. This allows removing Quicksort and binary search code.

did you check performance of this change - i am playing around with Tokenizer and noticed that Dict is really pretty slow compared to python {} in my tests. I have not yet checked if a Mojo sort/search implementation following the C-Code is faster .
Here is on discussion on this https://github.com/modularml/mojo/discussions/1747 , i just asked on Discord if there are any updates on this issue

mikowals commented 4 months ago

As I said in the same comment you quoted:

I didn't mess with Llamatune since this is going across Mojo versions but locally there was no change in tokens / sec. It is probably loading faster and more memory efficiently since this avoids the vocab sort and no longer reads entire tokenizer.bin.

I didn't play with prompts much which is when the dict lookup would be used. It is outside the transformer and not measured in the tokens/sec benchmark. In previous benchmarks of the vocab lookup it was only taking 6ms so I doubt there will be a difference with Dict big enough to warrant deciding on performance rather than simplifying the code base. But Llamatune reports the total runtime as well tokens / sec so any problems I have overlooked would be noticed there.

The Llama2 vocab is 30,000 tokens though and the binary search being replaced is O(Log(N)) average time while Dict is O(1).