tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 139 forks source link

Use String, Dict, and read_bytes to shorten and simplify #91

Closed mikowals closed 3 months ago

mikowals commented 4 months ago

This is based off current nightly branch (mojo 2024.4.161). It is a demo of some clean ups that can happen now that Mojo and its stdlib have added a lot of functionality that was missing when this was originally released. There could probably also be another round to remove TensorSlice and just use List[TensorF32] for each layer of weights.

The main changes are:

I am not sure this is ready to merge mostly because the of handling of special bytes handling in wrap and the old print function. I tried to persevere the functionality but haven't tested extensively. Ideally we could get proper handling from String and if not fix it in stdlib.

Also, I think the stdlib is going to shift to List[UInt8] for all bytes representations, including in String. So this change could also wait until after has happened and is incorporated.

I didn't mess with Llamatune since this is going across Mojo versions but locally there was no change in tokens / sec. It is probably loading faster and more memory efficiently since this avoids the vocab sort and no longer reads entire tokenizer.bin.