related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

Add small representative numpy vs breeze benchmarks #4

Closed eric-czech closed 4 years ago

eric-czech commented 4 years ago

Breeze won't support matrices with more than 32-bit signed int max values, but it would still be interesting to see if there are big differences between tall, skinny-ish ~2B element matrix col/row sum operations as compared to numpy. That's essentially all that's necessary for call rate, AF, heterozygosity, and HWE filtering in QC steps so the performance of those operations is pretty critical.

eric-czech commented 4 years ago

Notebooks: breeze | numpy

These only test a few operations but they're important ones. Notably, everything is faster in numpy with random matrix generation taking ~3x longer in breeze, sums along an axis taking ~2x longer, and sums along an axis after element-wise transformation taking >10x longer (yikes).

ravwojdyla commented 3 years ago

@eric-czech we should probably move these into public issue (maybe in the paper repo), wdyt? Also it would be interesting to see a perf difference between numpy (and/or Dask Array) and Hail's BlockMatrix.

eric-czech commented 3 years ago

@eric-czech we should probably move these into public issue (maybe in the paper repo), wdyt?

I thought it was public here but I opened https://github.com/pystatgen/sgkit-publication/issues/3 to frame it in the context of the paper claims. I personally think it would be great to be able to highlight the relative merits of numpy over breeze for the paper, but I'm not sure how long breeze will hang around as a part of Spark. @hammer may have some thoughts on the topic.

ravwojdyla commented 3 years ago

Ah, you are totally right @eric-czech, forgot this was public repo. But still +1 to the newly open issue (thanks).