malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Use xarray datasets for more return values #472

Open alimanfoo opened 8 months ago

alimanfoo commented 8 months ago

Currently we have a few situations where code could be simplified by returning xarray datasets instead of arrays or tuples of arrays.

A particular case motivating this issue is functions which require SNP allele counts. E.g., when computing biallelic SNP calls, we first need to compute allele counts, then use the counts to select sites. This results in two calls to snp_calls() under the hood, the first to get the allele counts, the second to then build the output dataset.

It would be simpler if we modified the snp_allele_counts() function to return a SNP calls dataset with the allele counts added as an additional variable, rather than just the allele counts array. This would avoid multiple calls to snp_calls() internally.

Similarly, biallelic_diplotypes() could return a SNP calls dataset with the diplotype array added as a variable. This would avoid having to return a tuple of diplotypes and sample identifiers, because the sample identifiers if needed could be accessed from the dataset.

Other cases where it also might be better to return a dataset include: biallelic_snp_pairwise_distances(), haplotype_pairwise_distances().