tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 139 forks source link

avoid data copy when reading files #79

Closed mikowals closed 9 months ago

mikowals commented 9 months ago

This both speeds up loading the models and reduced memory use. Prior to this I think we may have had 3 sets of the weights existing at a time. Memory usage on my M1 Pro now peaks ~8GB instead of ~12GB when running TinyLlama-1B.

mikowals commented 9 months ago

Llamatune picks up the speedup in file loading. Tokens per second is unchanged. This is the output from the stories110M benchmark.

Screenshot 2023-11-28 at 7 06 36 pm