virus-evolution / gofasta

MIT License
34 stars 1 forks source link

How does closest measure IUPAC codes? #32

Closed tseemann closed 2 years ago

tseemann commented 2 years ago

The closest command measures "raw genetic distance".

I assume this ignores deletions?

How does it count IUPAC codes? eg. N vs A, or N vs R, or A vs R ?

Any help appreciated.

benjamincjackson commented 2 years ago

Hi Torsten,

IUPAC codes are treated as the set of nucleotides that they represent; the raw distance is calculated as number of nucleotide differences per site, like so:

           certainly different
-----------------------------------------
(certainly different + certainly the same)

A vs S would count +1 for the numerator and the denominator A vs W doesn't count anywhere A vs A +1 denominator only, obviously. W vs W doesn't count anywhere

Deletions are ignored as I believe is standard when calculating nucleotide distances. (In the code they are treated just like Ns.)

I will add better documentation and probably some vignettes for usage, starting this week. This thing is being actively developed and I am very open to feature suggestions or pull requests, etc., too.