Open pveber opened 4 years ago
In order to make it easier to experiment, I created a branch in owl
repository, see there
This benchmark looks cool! I am (kinda) aware of this but nobody really did any benchmarking before to show how slow it is, so really appreciate this.
Lacaml uses LAPACK whereas Owl uses LAPACKE due to c-layout, some overhead is because of this, though not very noticeable when the matrices are big. Also, Owl internally uses genarray which requires conversion between array1 and genarray in init
function which will also introduce some overhead.
Probably need to even breakdown inside the function, to see which operation takes longer time then decide how to optimise.
Acturally, it would be very interesting to know -- how the perfermance difference varies as a function of matrix size ... probably at some point, both lacaml and owl reach similar performance.
Actually the conversion (flattening) happens also when multiplying matrices, and AFAIU this is partly where the difference comes from for small matrices. I pushed a new commit on the branch which defines several partial_dot
functions, that are copies of dot
, which go respectively until check, alloc and flatten steps (the only remaining step is then actually calling gemm
).
Here is the result:
┌───────────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├───────────────────────────┼──────────┼─────────┼────────────┤
│ lacaml-mat-vec-mul-4 │ 220.29ns │ 22.00w │ 37.10% │
│ owl-check-mat-vec-mul-4 │ 50.62ns │ 26.00w │ 8.53% │
│ owl-alloc-mat-vec-mul-4 │ 106.83ns │ 37.00w │ 17.99% │
│ owl-flatten-mat-vec-mul-4 │ 345.95ns │ 107.00w │ 58.26% │
│ owl-mat-vec-mul-4 │ 593.79ns │ 148.00w │ 100.00% │
└───────────────────────────┴──────────┴─────────┴────────────┘
After switching from
lacaml
toowl
on one of my projects, I observed a significant (x2) performance regression. I managed to track it down to the fact that I'm performing a lot of matrix operations on small (4x4, typically) matrices. So it seems thatowl
introduces a non negligible overhead forlapack
operations. In order to document the problem and make it reproducible I wrote a tiny benchmark. I also noticed that vector/matrix creation viainit
function is also notably slower.Is there any chance to reduce that overhead?