Native BLAS operations slow?

scalanlp / breeze

Breeze is a numerical processing library for Scala.

www.scalanlp.org

Apache License 2.0

3.44k stars 691 forks source link

Native BLAS operations slow? #854

Open kwalcock opened 9 months ago

kwalcock commented 9 months ago

So far, any computer that is able to run with the native support for BLAS that is supplied by the netlib transitive dependency (i.e., Linux-aarch64 or Linux-amd64) runs my app about 3x slower than the Scala/Java implementation. Do others notice the same? I just have some fairly simple matrix multiplications and vector additions. Is there any way to just disable use of native code, because the Scala interface is really nice and I'd like to keep using it? In order to compare performance, I remove the .so files from the blas-3.0.1.jar file so that the native code fails to load, but I can't ship my project with that hack being necessary.

dlwh commented 9 months ago

That's surprising to me... I'd hope @luhenry would be open to adding an env variable or something to disable native code.

Is it true even for very large matrices? I'd have thought there was a size at which native is going to win. If there's a threshold I'm happy to put one into Breeze (which we already do for dot product)

luhenry commented 9 months ago

@kwalcock would you have a reproducing case?

Generally, calling into native would be slower for very small matrices (overhead of the call mostly). That would only be visible if you are doing many many operations. Happy to look into any case you share! :)

dlwh commented 9 months ago

I'll add that "lots of tiny matmuls/matvecs" is imho a valid use case to optimize for, either at the netlib level or the Breeze level

kwalcock commented 9 months ago

I suspect the threshold would be different for everyone. Here we could run our program twice to measure and then pick native on or off. In the data I was working with, a typical problem has four multiplications of (57 768) x (768 768) and then 10,000 multiplications of (1 1536) x (1536 1536). That is then scaled to infinitely many problems. I don't know whether that is small, medium, or large.

dlwh commented 9 months ago

That's definitely big enough I would have thought native would win out

On Fri, Nov 10, 2023 at 4:10 PM Keith Alcock @.***> wrote:

I suspect the threshold would be different for everyone. Here we could run our program twice to measure and then pick native on or off. In the data I was working with, a typical problem has four multiplications of (57 768) x (768 768) and then 10,000 multiplications of (1 1536) x (1536 1536). That is then scaled to infinitely many problems. I don't know whether that is small, medium, or large.

— Reply to this email directly, view it on GitHub https://github.com/scalanlp/breeze/issues/854#issuecomment-1806582707, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLILVK2CNKUBBV4DJ6Y3YD266RAVCNFSM6AAAAAA7GRDZ3KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGU4DENZQG4 . You are receiving this because you commented.Message ID: @.***>