tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

binary search to get prompt tokens #26

Closed mikowals closed 12 months ago

mikowals commented 12 months ago

Locally I confirmed this out put the same prompt_tokens as str_lookup.

In benchmarking in a Github codespace with 8 cores and 32 GB ram sorting took 6ms and getting prompt tokens of a few sentence prompt took 1ms. With old str_lookup getting the prompt tokens for the same prompt to 230ms.

The code can be cleaned up a lot when it becomes easier to have a pointer hold a locally defined struct with both the string and index.

tairov commented 12 months ago

Look cool. I'll check this out a bit later .

Did you notice any tok/s performance improvement ?

mikowals commented 12 months ago

The tok/s improvement was negligible for the size of prompts I was testing on. For longer prompts it probably is noticeable but would occur as the prompt be read a couple seconds faster and the change then being averaged over how many tokens the model went on to produce. So I focused on directly comparing the code being replaced.

tairov commented 12 months ago

merged! @mikowals thanks for the clean implementation!