Closed magician-blue closed 11 months ago
Runtime(num_cores() // 2)
-- was showing best performance on target machine.
@magician-blue do you have different experience with it?
@tairov In my case, as I set Runtime(num_cores())
and the speed increase from 3.6
to 4.4
on tinyllama1.1B.
I'm not sure the meaning of paramter inside Runtime(the cores or threads).
I think, the issue is that 0.3.1 Mojo version is now showing different count for CPU cores. On Github Codespaces, I have 4 cores. In Mojo 0.2.1, num_cores
returned 4, so when divided by two (like @tairov mentioned) we got the best performance. Now, in Mojo 0.3.1 num_cores
return 2 cores so if we divide by 2 we have one core only.
Here is an issue with details - https://github.com/modularml/mojo/issues/950. It also has potential explanation. My Github Codespace instance actually has two physical cores and each core has two threads so num_cores
now returns the number of physical cores (2). But, it is unexpected change. Altrough considering the language is young and changes fast we might expect many breaking changes.
So, I think we haven't make full use of the threads.
I think we are not making a full use of cores actually. If it is true that num_cores
returns number of physical CPU cores now, we cut it in half. If you try to put Runtime(12)
(to use 12 threads), you will (I assume) mostly likely see a drop in performance due to threads waiting for other threads.
As far as I understood, 12 threads do not actually mean you can execute 12 programs at the same time. Threads allow for faster context switch, but only a core can run a code (which you have 6 of and I have 2).
All in all, would be good to know what happend with num_cores
from 0.2.1 to 0.3.1. And as soon as we clarify it, it will probably make sense to put Runtime(num_cores())
as it will provide the number of max parrallel executions.
And here is an answer - https://github.com/modularml/mojo/issues/950#issuecomment-1741526370. There was a change. So, now, it is better to just use num_cores
and do not divide by 2.
Thanks @VMois for pointing this out.
I think it' worthwhile to introduce a new param for flexibly configuring default threads/cores amount.
Since -t
is already used, we can come up to other arg name.
My options:
-T -- Threads
-th -- threads
-c -- cores
-k
-rt -- runtime threads
llama.cpp
, whisper.cpp
-- uses -t
for this purposes.
llama2.c
-- uses OMP_NUM_THREADS=NNN
env variable when it's compiled in runomp
mode
Out of the options provided, I like -T
or -rt
(familiar to Mojo developers cause Runtime). We can also consider -j
as it is used in some other tools like cmake.
added -j
param #44
Althoug I have 6 core cpu, I actually have 12 threads.
In our code, we take it for granted that num_cores = threads.
print("num hardware threads: ", num_cores())
andself.rt = Runtime(num_cores() // 2)
So, I think we haven't make full use of the threads.