sgkit-dev / sgkit-publication

Sgkit publication repository
5 stars 5 forks source link

Greater than k-fold speedup in sgkit with k threads #58

Closed jeromekelleher closed 11 months ago

jeromekelleher commented 11 months ago

As noted in #57 there is a problem with sgkit's performance numbers with multiple threads. Basically we're doing better than we should:

tmp

We're setting up a threaded dask worker for benchmarking here: https://github.com/pystatgen/sgkit-publication/blob/69bdb18667971d16c0e45bbfd799576daf7bc1a7/src/collect_data.py#L129

Any idea what could be causing this @benjeffery? One explanation is that our 1 thread base-data is artificially slow for some reason? Otherwise, Dask must be somehow using more threads than we tell it?

jeromekelleher commented 11 months ago

The savvy and bcftools numbers are doing what you'd expect, which makes me think we're probably measuring and computing the stat correctly.

jeromekelleher commented 11 months ago

I'm getting slightly different results when running against new data. I'm suspicious about the data currently committed, I think we want to run it all again.

In particular, I'm seeing savvy running roughly ten-fold faster, much more in keeping with prior expectations.

jeromekelleher commented 11 months ago

Closing this - it looks like the data for a single thread that we were comparing with was incomplete or something. Rerunning things are looking much more like what you'd expect.