Closed szhan closed 1 year ago
Thanks, @benjeffery !
Just for my understanding, the function should return xarray.Dataset
, so that zarr
can act upon it?
I'd just like to compare call_genotype
of ds1 and ds2, so it makes sense to return ds1 and ds2 remapped with only the common sites.
Should we just do that all within remap_genotypes
? Or is that too much going on for a single function?
Since sgkit
xarray
datasets have the variable variant_contig
, the function should check both variant_contig
and variant_positions
.
I'm thinking that get_matching_indices
should return xarray.DataArray
as well. It should help analyse big data sets, right? @benjeffery
I've modified get_matching_indices
to return numpy.ndarray
instead of list
.
The "compatible" ds1
and ds2
should have the same allele lists.
What i'd like to do is multi-way VCF comparisons, i.e., imputed genotypes from lshmm (ds_lshmm
), imputed genotypes from BEAGLE (ds_beagle
), and ground-truth genotypes (ds_truth
). I think the easiest way is to just make ds_lshmm
and ds_beagle
compatible with ds_truth
by calling make_compatible_genotypes
twice.
For now, let's assume that only ACGT are allowed, to keep things simple.
Also, I still need to add tests for make_compatible_genotypes
.
There are still more tests that could be added, but the current tests are good enough, I think. Will continue adding more tests in a separate issue.
Addresses #94