tskit-dev / tskit

Population-scale genomics
MIT License
152 stars 71 forks source link

Method to remove silent mutations #1252

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

As discussed in an msprime issue (https://github.com/tskit-dev/msprime/pull/1548#issuecomment-801165185) it would be useful to have a method to remove silent mutations.

Snags:

  1. Which mutation do we keep?
  2. What do we do with metadata? (options: ignore entirely, only consider mutations silent if their metadata is identical, ...)

There is also possibility of adding this to the canonicalise operation, but on reflection I think maybe not (mostly because of metadata question)

petrelharp commented 3 years ago

Which mutation do we keep?

Any mutation where derived_state != previous state.

But - I'm not sure we even want to provide this method? The only semi-legit use case I can think of is if someone simulates from some strange model where there's lots and lots of silent mutations, and wants to remove them for efficiency. For sure people might think they want it in other situations, but I'm not convinced.

jeromekelleher commented 3 years ago

Which mutation do we keep? Any mutation where derived_state != previous state.

Suppose we have a chain of mutations A -> A -> .... -> A. Do we keep the first or the last one?

The method is pretty low-priority for me too, just opening this issue as a way to track the discussion.

petrelharp commented 3 years ago

Suppose we have a chain of mutations A -> A -> .... -> A. Do we keep the first or the last one?

If the first A is the ancestral state, then we keep none of them! If it's T -> A -> A -> ... then we keep only the first one.