tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

"Ancestral state" or "ancestral allele"? #971

Closed hyanwong closed 2 months ago

hyanwong commented 2 months ago

When we create a VariantData class, we specify ancesteral_allele as an array of strings. However, when we return the sites_ancestral_allele array, it is a numpy array of indexes into the alleles list. It's a bit confusing to use "ancestral_allele to mean a string in one context, and a numerical index in another

When devising the SampleData class, we took care to describe the ancestral strings as "ancestral states", and used "ancestral allele" to refer to an index into the alleles list (there is some discussion about this on GitHub somewhere, but I can't dig it up). Should we therefore rename the second argument of VariantData(...) to ancestral_state? This would match the tskit terminology, which is nice (but note that the VCF info fields tend to use AA as an abbreviation for "Ancestral Allele", referring to a string, so perhaps we can't win).

benjeffery commented 2 months ago

I don't have a strong opinion here. I guess tsinfer should be consistent though.

hyanwong commented 2 months ago

Fixed in #963