Open mratsim opened 5 years ago
Changing to nested parallelism.
Unfortunately, parallelizing on a single loop doesn't scale well (unless we multiply bigger matrices). BLIS multithreading readme mentions multithreading at multiple level.
Regarding nested parallelism in OpenMP, at first glance it seems quite tricky with a real risk of oversubscription or OpenMP not spawning new threads on the second loop if we use dynamic schedule. Intel sugests using the recent OpenMP task construct.
cc @laurae2
On benchmark on dual Xeon Gold 6154 vs MKL:
According to the paper
[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication Smith et al
Parallelism should be done around
jc
(dimensionnc
)Note that
nc
is often 4096 so we might need another distribution scheme.