tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

In some case, we haven't make full use of threads #31

Closed magician-blue closed 11 months ago

magician-blue commented 11 months ago

Althoug I have 6 core cpu, I actually have 12 threads. image

In our code, we take it for granted that num_cores = threads. print("num hardware threads: ", num_cores()) and self.rt = Runtime(num_cores() // 2)

So, I think we haven't make full use of the threads.

tairov commented 11 months ago

Runtime(num_cores() // 2) -- was showing best performance on target machine. @magician-blue do you have different experience with it?

magician-blue commented 11 months ago

@tairov In my case, as I set Runtime(num_cores()) and the speed increase from 3.6 to 4.4 on tinyllama1.1B. I'm not sure the meaning of paramter inside Runtime(the cores or threads).

VMois commented 11 months ago

I think, the issue is that 0.3.1 Mojo version is now showing different count for CPU cores. On Github Codespaces, I have 4 cores. In Mojo 0.2.1, num_cores returned 4, so when divided by two (like @tairov mentioned) we got the best performance. Now, in Mojo 0.3.1 num_cores return 2 cores so if we divide by 2 we have one core only.

Here is an issue with details - https://github.com/modularml/mojo/issues/950. It also has potential explanation. My Github Codespace instance actually has two physical cores and each core has two threads so num_cores now returns the number of physical cores (2). But, it is unexpected change. Altrough considering the language is young and changes fast we might expect many breaking changes.

VMois commented 11 months ago

So, I think we haven't make full use of the threads.

I think we are not making a full use of cores actually. If it is true that num_cores returns number of physical CPU cores now, we cut it in half. If you try to put Runtime(12) (to use 12 threads), you will (I assume) mostly likely see a drop in performance due to threads waiting for other threads.

As far as I understood, 12 threads do not actually mean you can execute 12 programs at the same time. Threads allow for faster context switch, but only a core can run a code (which you have 6 of and I have 2).

All in all, would be good to know what happend with num_cores from 0.2.1 to 0.3.1. And as soon as we clarify it, it will probably make sense to put Runtime(num_cores()) as it will provide the number of max parrallel executions.

VMois commented 11 months ago

And here is an answer - https://github.com/modularml/mojo/issues/950#issuecomment-1741526370. There was a change. So, now, it is better to just use num_cores and do not divide by 2.

tairov commented 11 months ago

Thanks @VMois for pointing this out. I think it' worthwhile to introduce a new param for flexibly configuring default threads/cores amount. Since -t is already used, we can come up to other arg name. My options:

-T  -- Threads
-th -- threads
-c  -- cores
-k
-rt -- runtime threads

llama.cpp , whisper.cpp -- uses -t for this purposes. llama2.c -- uses OMP_NUM_THREADS=NNN env variable when it's compiled in runomp mode

VMois commented 11 months ago

Out of the options provided, I like -T or -rt (familiar to Mojo developers cause Runtime). We can also consider -j as it is used in some other tools like cmake.

tairov commented 11 months ago

added -j param #44