szhan / tsimpute

Genome-wide genotype imputation using tree sequences.
MIT License
0 stars 0 forks source link

Functions to facilitate VCF comparison using sgkit #95

Closed szhan closed 1 year ago

szhan commented 1 year ago

Addresses #94

szhan commented 1 year ago

Thanks, @benjeffery !

Just for my understanding, the function should return xarray.Dataset, so that zarr can act upon it?

I'd just like to compare call_genotype of ds1 and ds2, so it makes sense to return ds1 and ds2 remapped with only the common sites.

szhan commented 1 year ago

Should we just do that all within remap_genotypes? Or is that too much going on for a single function?

szhan commented 1 year ago

Since sgkit xarray datasets have the variable variant_contig, the function should check both variant_contig and variant_positions.

szhan commented 1 year ago

I'm thinking that get_matching_indices should return xarray.DataArray as well. It should help analyse big data sets, right? @benjeffery

szhan commented 1 year ago

I've modified get_matching_indices to return numpy.ndarray instead of list.

szhan commented 1 year ago

The "compatible" ds1 and ds2 should have the same allele lists.

szhan commented 1 year ago

What i'd like to do is multi-way VCF comparisons, i.e., imputed genotypes from lshmm (ds_lshmm), imputed genotypes from BEAGLE (ds_beagle), and ground-truth genotypes (ds_truth). I think the easiest way is to just make ds_lshmm and ds_beagle compatible with ds_truth by calling make_compatible_genotypes twice.

szhan commented 1 year ago

For now, let's assume that only ACGT are allowed, to keep things simple.

szhan commented 1 year ago

Also, I still need to add tests for make_compatible_genotypes.

szhan commented 1 year ago

There are still more tests that could be added, but the current tests are good enough, I think. Will continue adding more tests in a separate issue.