haskell / vector

An efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimisation framework .
Other
366 stars 139 forks source link

Question about the performance of fused vector operations #259

Open Ziharrk opened 4 years ago

Ziharrk commented 4 years ago

Hi, I have profiled some of my code that is using Data.Vector.Unboxed and to my confusion the biggest cost centre is the (>>=) on Identity from Data.Vecor.Fusion.Util. In fact, it is individually responsible for 18% of my execution time. Together with its children it is responsible for 87% of the time, which is expected because my program is basically only using Vectors. I am not familiar with the stream fusion internals of Data.Vector.

Is this performance normal?

One detail about my code: I am mostly using map/fold and I need to store a lot of intermediate result vectors in a datastructure.

cartazio commented 4 years ago

Hello Kai/zihark: it sounds like if you share the code and how you’re profiling, we can better understand what’s afoot.

Bind on Identity monad is a no op, but is still sequencing computation. If you’re using , referring to intermediate vectors, you won’t really benefit from stream fusion.

Stream fusion based optimization really shines when you use intermediate vectors only once. Aka for the next fusible step. Otherwise the avoiding allocations can’t really happen

Hsve you looked at the profiling options in the manual for your version of ghc? You may need to mess around with that to get more accurate profiling. Internally all vector code is monadic. It’s just that the pure stuff uses the identity monad :). Identity bind should be optimized away. So I’d be wondering what ghc settings you’re using. Vector code really really benefits from O2 optimizations

Ziharrk commented 4 years ago

Hi cartazio, thanks for your answer. I am using stack for my project and used stack install --profiling --ghc-options=-fprof-auto which also did a recompilation of Vector, but I am unsure about the optimization level. I will check that tomorrow.

My full code depends on some external C library, but I can create a minimal example that does not call any other libraries.

Ziharrk commented 4 years ago

I've uploaded my simplified code and the profiliig results at at https://github.com/Ziharrk/graph-test, built it with stack build --profile --ghc-options="-fprof-auto -O2"
and executed it with stack exec -- graph-test +RTS -p

I've also cleared my local stack packages first.

The results are basically the same as with my original program

cartazio commented 4 years ago

Ok. So what this is really about is Fusion doesn’t happen when there’s nontrivial fanin / fanout. Aka dot product is fine, because each vector is used once. But matrix mult for two n by n matrices uses each row/column vector n times (whether inner product or outer product )

More broadly : fusion optimization (stream flavored or future tech any applicable research system I’m aware of), does not really provide a useful optimization when the vectors within an algorithm are used more than once by disjoint sub computations.

In such cases you will benefit from internally writing an imperative algorithm and doing some sorta unsafe/safe freeze at the end plus a runst wrapper etc etc.

https://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf Is the paper I generally recommend folks read.

It’s worth noting that For matrix vector and matrix matrix product, fusion rewriting will make your computation slower. Efficient implementations take care to carefully reuse work done in terms of time and space resource for good memory locality. A naive fusion flavored matrix mult actually could wind up recompiling each row/column n times!

On Fri, Nov 22, 2019 at 11:30 AM Kai-Oliver Prott notifications@github.com wrote:

I've uploaded my simplified code and the profiliig results at at https://github.com/Ziharrk/graph-test, built it with stack build --profile --ghc-options="-fprof-auto -O2" and executed it with stack exec -- graph-test +RTS -p

I've also cleared my local stack packages first.

The results are basically the same as with my original program

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/259?email_source=notifications&email_token=AAABBQWB4WLABN6ABZYVONLQVACLLA5CNFSM4JQAUMNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE6E6GI#issuecomment-557600537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXGLYPDUTNC7YAQH33QVACLLANCNFSM4JQAUMNA .

cartazio commented 4 years ago

This question does come up with exactly these algorithms very often. I still should look at what profiling says. Maybe there’s a way to tickle rewrite rules to not trigger these blowups. But I suspect that would require exposing some static analyze stuff to rewrite rule syntax. Which might be a good idea

On Fri, Nov 22, 2019 at 6:45 PM Carter Schonwald carter.schonwald@gmail.com wrote:

Ok. So what this is really about is Fusion doesn’t happen when there’s nontrivial fanin / fanout. Aka dot product is fine, because each vector is used once. But matrix mult for two n by n matrices uses each row/column vector n times (whether inner product or outer product )

More broadly : fusion optimization (stream flavored or future tech any applicable research system I’m aware of), does not really provide a useful optimization when the vectors within an algorithm are used more than once by disjoint sub computations.

In such cases you will benefit from internally writing an imperative algorithm and doing some sorta unsafe/safe freeze at the end plus a runst wrapper etc etc.

https://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf Is the paper I generally recommend folks read.

It’s worth noting that For matrix vector and matrix matrix product, fusion rewriting will make your computation slower. Efficient implementations take care to carefully reuse work done in terms of time and space resource for good memory locality. A naive fusion flavored matrix mult actually could wind up recompiling each row/column n times!

On Fri, Nov 22, 2019 at 11:30 AM Kai-Oliver Prott < notifications@github.com> wrote:

I've uploaded my simplified code and the profiliig results at at https://github.com/Ziharrk/graph-test, built it with stack build --profile --ghc-options="-fprof-auto -O2" and executed it with stack exec -- graph-test +RTS -p

I've also cleared my local stack packages first.

The results are basically the same as with my original program

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/259?email_source=notifications&email_token=AAABBQWB4WLABN6ABZYVONLQVACLLA5CNFSM4JQAUMNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE6E6GI#issuecomment-557600537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXGLYPDUTNC7YAQH33QVACLLANCNFSM4JQAUMNA .

Ziharrk commented 4 years ago

Thanks so much for your answer! I will try your suggestions :)

Am I supposed to close this issue now or what's the policy on this repository?

cartazio commented 4 years ago

It’d be great if you could share your followup experiments on this ticket so that it can help provide info for future folks / more ammo for some future reference doc. Questions like this do come up every few months

On Sat, Nov 23, 2019 at 10:57 AM Kai-Oliver Prott notifications@github.com wrote:

Thanks so much for your answer! I will try your suggestions :)

Am I supposed to close this issue now or what's the policy on this repository?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/259?email_source=notifications&email_token=AAABBQWWTWREBWFXG7SQUBTQVFHG7A5CNFSM4JQAUMNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE7X6MI#issuecomment-557809457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXCN64L2NE5GTRNMPLQVFHG7ANCNFSM4JQAUMNA .

Magalame commented 4 years ago

@cartazio the paper focuses on cache use but does not seem to describe the kernel that takes care of the m_r*n_r multiplication. It seems like this gist does (and describes how to use SIMD for example), if I understand properly. I'll try to combine the two in haskell and see what comes out

cartazio commented 4 years ago

Wonderful!

I’m slowly starting to think about this sort of stuff again. So please totally feel welcome to email me or message me on irc on freenode or whatever if you like.

On Thu, Jul 9, 2020 at 10:43 AM Magalame notifications@github.com wrote:

@cartazio https://github.com/cartazio the paper you posted focuses on cache use but does not describe the kernel that takes care of the m_r*n_r multiplication. It seems like this gist https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0 does (and describes how to use SIMD for example). I'll try to combine the two in haskell and see what comes out

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/259#issuecomment-656169836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQSPI2PLFNS7TSOLFITR2XJQTANCNFSM4JQAUMNA .

Magalame commented 4 years ago

Will do!