Closed itramble closed 1 year ago
Wow! Looks cool. Thanks @itramble for you efforts.
Did you know what means rt.parallelism_level()
?
On my VM I set threads = 3 and I got best performance. I did silly assumption when nthreads = num_cores() // 2
.. This expression leads to error /__w/modular/modular/Kernels/mojo/builtin/_startup.mojo:70:1: error: no viable expansions foun
I'm happy to merge this PR as is
Did you know what means rt.parallelism_level() ?
It returns the number of threads that the Runtime was constructed with (by default Runtime()
creates a runtime with num_threads=num_cores()).
This expression leads to error /__w/modular/modular/Kernels/mojo/builtin/_startup.mojo:70:1: error: no viable expansions found
Hm this diff works for me:
diff --git a/llama2.mojo b/llama2.mojo
index fa4d4a4..6b24b57 100644
--- a/llama2.mojo
+++ b/llama2.mojo
@@ -282,7 +282,7 @@ struct RunState:
self.key_cache.alloc_zero()
self.value_cache = Matrix3(config.n_layers, config.seq_len, config.dim)
self.value_cache.alloc_zero()
- self.rt = Runtime()
+ self.rt = Runtime(num_cores() // 2)
struct TransformerWeights:
End up with self.rt = Runtime(num_cores() // 2)
Before I tried via aliasing alias nthreads = num_cores() // 2
-- that's why it was failing.
So far best performance
Before I tried via aliasing alias nthreads = num_cores() // 2 -- that's why it was failing.
Ah you can only use alias
for compile time values and num_cores() is a runtime value.
Increases performance from ~190tok/s->~400tok/s on my machine. If you tune the number of threads
self.rt = Runtime(nthreads)
it gets up to ~500tok/s (best i found was 12 threads).There are a few issues with parallelize that need to be worked around: