nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
204 stars 58 forks source link

Allow deletions and insertions in virus properties input files #1126

Open ammaraziz opened 1 year ago

ammaraziz commented 1 year ago

Virus properties only supports labeling of substitutions. There are mutations of interest, such as antiviral resistance, Ngene, which are deletions.

My usecase for virus properties might be outside the scope of what was envisioned. Essentially I have a set of genetic mutations (snps, indels) that are linked to some phenotype (antiviral resistance, Ngene mutations affecting RAT tests) that I am interested in. I would like to use Nextclade to identify these mutations.

Thanks!

corneliusroemer commented 1 year ago

Thanks for the suggestion. That's a reasonable extension. One limitation we have at the moment is that we don't have a "private deletions" feature yet.

Labeled mutations are a subset of private mutations, so to keep with the logic we'd first need to support private deletions. Not at all unreasonable and actually something I've been thinking about as well.

One challenge with deletions is that if you keep them as single bases they can get overwhelmingly many. If you make them ranges, single indel difference artefacts can make a large difference.

I'm leaning towards using ranges nonetheless.

To be symmetric, it would make sense to add private insertions.

corneliusroemer commented 1 year ago

Having read your ubio post @ammaraziz, I think you are suggesting two things:

  1. Extend concept of labeled private mutations (substitutions) to labeled private indels. For this we first need to introduce the concept/feature of private indels. Then split the private indels into "reversions, labeled, neither of the two" just as we currently do with substitutions.

  2. Extend the concept of labeled mutations beyond private mutations (private mutations are those mutations that differ wrt nearest neighbour sequence on reference tree). This makes sense for things where you are looking at broader trends and are not so concerned with the individual sequence and whether it is good quality or not (private mutations originated as QC metric, and the labeled/reversion feature is an extension, still coming from the QC perspective).

We already calculate some custom global (non-private only) metrics like:

Reporting presence of certain mutations (irrespective of whether they are private or not) for antiviral resistance, RAT escape etc would make sense.