lenaschimmel / sc2rf

SARS-Cov-2 Recombinant Finder for fasta sequences
MIT License
48 stars 13 forks source link

ENH: Differentiate between clade defining mutations and optional mutations #15

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

If I understand your script correctly, you treat all mutations that are above the user specified threshold identical.

There's room for improvement there.

It would make sense to use two kinds of mutation types for each clade:

  1. Defining mutations that should be present in (almost) all sequences of a clade, so maybe all those mutations present >95%. If these are absent, it means there's a problem either with sequence quality or something else. Absence is very harmful.
  2. Common mutations that sometimes occur, but whose absence does not mean much. Rather, the presence of these mutations increases the probability of a sequence belonging to the clade.

Do you know what I mean? One threshold does not suffice for both concepts.

I'll think a bit more about recombinant detection myself - maybe there are further improvements possible. This is an amazing tool already, though!

lenaschimmel commented 2 years ago

Yeah, I absolute get what you mean. I know this is not ideal, and having two thresholds would already be a big improvement. Maybe I will change it that way soon.

On the other hand, I have a (still very vague) concept of probability computations in my head, that would be even more powerful and need no hard thresholds at all. It would also affect the way that breakpoints (and intermissions, if they will still exist) are handled and the way the output is displayed. Maybe that's more like version 2.0 of this tool, nothing for the near future.

I'll keep thinking about it!

PS: That probability stuff might be a lot of hard work, but since working on these probability computatons on my previous project Dystonse that doesn't scare me any more.

corneliusroemer commented 2 years ago

I think I know what you mean, something like max likelihood and/or naive Bayes could be applicable here.

maciekboni commented 2 years ago

Hi Both - I can walk you through over a zoom call what the Delta_(m,n,2) statistic gets you and how it's constructed. It's non-parametric and you won't need to set thresholds. And, a table of the p-values is pre-built (this is the computationally expensive part) so you just look them up as you need them.