sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
217 stars 32 forks source link

Add 'matching' method option to identity_by_state #1229

Closed timothymillar closed 5 days ago

timothymillar commented 2 weeks ago
jeromekelleher commented 2 weeks ago

Requiring a single chunk in the samples dimension is a bit restrictive - perhaps we could add a note on how that could be generalised?

timothymillar commented 2 weeks ago

Thanks @jeromekelleher, I can have a look into chunking in the samples dimension before I merge.

timothymillar commented 1 week ago

Do you want to have another look at this @jeromekelleher? It now supports chunking of the samples dimension with quite a bit more code. I would have preferred to avoid having two gufuncs but doing so gives an almost 2x speed up when there is no chunking in the samples dimension (most common case).

Some rough benchmarks show how the 'matching' approach scales much better with the number of alleles. Memory ussage is also much better. The 'frequencies' method is roughly an order of magnitude faster in the biallelic case:

matching_perf

(Other params where: n_variant=10_000, n_sample=100, n_ploidy=4, missing_pct=0.01 and chunks of 5000 variants * 50 samples).