Add 'matching' method option to identity_by_state

timothymillar commented 2 weeks ago

[x] Fixes #1227
[x] Tests added
[x] User visible changes (including notable bug fixes) are documented in changelog.rst

jeromekelleher commented 2 weeks ago

Requiring a single chunk in the samples dimension is a bit restrictive - perhaps we could add a note on how that could be generalised?

timothymillar commented 2 weeks ago

Thanks @jeromekelleher, I can have a look into chunking in the samples dimension before I merge.

timothymillar commented 1 week ago

Do you want to have another look at this @jeromekelleher? It now supports chunking of the samples dimension with quite a bit more code. I would have preferred to avoid having two gufuncs but doing so gives an almost 2x speed up when there is no chunking in the samples dimension (most common case).

Some rough benchmarks show how the 'matching' approach scales much better with the number of alleles. Memory ussage is also much better. The 'frequencies' method is roughly an order of magnitude faster in the biallelic case:

matching_perf

(Other params where: n_variant=10_000, n_sample=100, n_ploidy=4, missing_pct=0.01 and chunks of 5000 variants * 50 samples).

sgkit-dev / sgkit

Add 'matching' method option to identity_by_state #1229