matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Preserve ambiguous nucleotides #306

Closed eharkins closed 4 years ago

eharkins commented 4 years ago

Raxml-ng, the current standard ancestral sequence reconstruction program in CFT infers ambiguous nucleotides. Partis can only take A, C, G, T, and N nucleotides so we convert all ambiguous nucleotides to N.

We checked to make sure we weren't losing too much information by doing this as follows:

Average the rate of inferred ambiguous nucleotides (# of inferred ambiguous NTs excluding - and N/# of total inferred NTs) for 3512 clonal families for which we have used raxml-ng's ancestral reconstruction so far. I excluded N and - because those are already in many of the partis sequences that it gets as input. The result was:

0.012240963729100954 aka ~1.2%

Here is how this was implemented.

eharkins commented 4 years ago

@mmshipley said:

3500 clonal families is a lot, I think for now let's not worry about it. When you've processed substantially more clonal families we can always recalculate that stat, but it seems like it doesn't occur too often which is good.

so I'm closing this issue for now.