owlbarn / owl

Owl - OCaml Scientific Computing @ https://ocaml.xyz
MIT License
1.21k stars 122 forks source link

Overhead on matrix operations for small matrices #453

Open pveber opened 4 years ago

pveber commented 4 years ago

After switching from lacaml to owl on one of my projects, I observed a significant (x2) performance regression. I managed to track it down to the fact that I'm performing a lot of matrix operations on small (4x4, typically) matrices. So it seems that owl introduces a non negligible overhead for lapack operations. In order to document the problem and make it reproducible I wrote a tiny benchmark. I also noticed that vector/matrix creation via init function is also notably slower.

Is there any chance to reduce that overhead?

pveber commented 4 years ago

In order to make it easier to experiment, I created a branch in owl repository, see there

ryanrhymes commented 4 years ago

This benchmark looks cool! I am (kinda) aware of this but nobody really did any benchmarking before to show how slow it is, so really appreciate this.

Lacaml uses LAPACK whereas Owl uses LAPACKE due to c-layout, some overhead is because of this, though not very noticeable when the matrices are big. Also, Owl internally uses genarray which requires conversion between array1 and genarray in init function which will also introduce some overhead.

Probably need to even breakdown inside the function, to see which operation takes longer time then decide how to optimise.

ryanrhymes commented 4 years ago

Acturally, it would be very interesting to know -- how the perfermance difference varies as a function of matrix size ... probably at some point, both lacaml and owl reach similar performance.

pveber commented 4 years ago

Actually the conversion (flattening) happens also when multiplying matrices, and AFAIU this is partly where the difference comes from for small matrices. I pushed a new commit on the branch which defines several partial_dot functions, that are copies of dot, which go respectively until check, alloc and flatten steps (the only remaining step is then actually calling gemm).

Here is the result:

┌───────────────────────────┬──────────┬─────────┬────────────┐
│ Name                      │ Time/Run │ mWd/Run │ Percentage │
├───────────────────────────┼──────────┼─────────┼────────────┤
│ lacaml-mat-vec-mul-4      │ 220.29ns │  22.00w │     37.10% │
│ owl-check-mat-vec-mul-4   │  50.62ns │  26.00w │      8.53% │
│ owl-alloc-mat-vec-mul-4   │ 106.83ns │  37.00w │     17.99% │
│ owl-flatten-mat-vec-mul-4 │ 345.95ns │ 107.00w │     58.26% │
│ owl-mat-vec-mul-4         │ 593.79ns │ 148.00w │    100.00% │
└───────────────────────────┴──────────┴─────────┴────────────┘