tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Number of variable sites (not just num_sites) #2899

Open hyanwong opened 5 months ago

hyanwong commented 5 months ago

Just a quick thought: would it be helpful to cache the number of variable sites in a tree sequence, as well as the number of actual sites? I am starting to encounter cases where sites are defined but have no associated mutations. ts.num_variable_sites would seem like a sensible thing. I guess it might get hairy when there are mutations but no variation, however (e.g. if the mutations are reverted, or do not change the state)

jeromekelleher commented 5 months ago

It's not trivial for the reasons you outline. Zero mutation sites is easy to do though, and we use that somewhere else