Closed timothymillar closed 5 days ago
Requiring a single chunk in the samples dimension is a bit restrictive - perhaps we could add a note on how that could be generalised?
Thanks @jeromekelleher, I can have a look into chunking in the samples dimension before I merge.
Do you want to have another look at this @jeromekelleher? It now supports chunking of the samples dimension with quite a bit more code. I would have preferred to avoid having two gufuncs but doing so gives an almost 2x speed up when there is no chunking in the samples dimension (most common case).
Some rough benchmarks show how the 'matching' approach scales much better with the number of alleles. Memory ussage is also much better. The 'frequencies' method is roughly an order of magnitude faster in the biallelic case:
(Other params where: n_variant=10_000
, n_sample=100
, n_ploidy=4
, missing_pct=0.01
and chunks of 5000 variants * 50 samples).
changelog.rst