Closed jeromekelleher closed 11 months ago
The savvy and bcftools numbers are doing what you'd expect, which makes me think we're probably measuring and computing the stat correctly.
I'm getting slightly different results when running against new data. I'm suspicious about the data currently committed, I think we want to run it all again.
In particular, I'm seeing savvy running roughly ten-fold faster, much more in keeping with prior expectations.
Closing this - it looks like the data for a single thread that we were comparing with was incomplete or something. Rerunning things are looking much more like what you'd expect.
As noted in #57 there is a problem with sgkit's performance numbers with multiple threads. Basically we're doing better than we should:
We're setting up a threaded dask worker for benchmarking here: https://github.com/pystatgen/sgkit-publication/blob/69bdb18667971d16c0e45bbfd799576daf7bc1a7/src/collect_data.py#L129
Any idea what could be causing this @benjeffery? One explanation is that our 1 thread base-data is artificially slow for some reason? Otherwise, Dask must be somehow using more threads than we tell it?