Closed donovan-h-parks closed 8 years ago
Hello Donovan!
Is J an ambiguous base? I haven't heard of it: http://www.bioinformatics.org/sms/iupac.html
Guess it depends on who you ask: http://www.insdc.org/files/feature_table.html#7.4.3
Also, it appears in a non-trivial number of genomes from GenBank. :)
Hey @dparks1134 --
Thanks for your patience with this. I have implemented this, and as a diagnostic I have the following table:
'A' 'R' 'N' 'D' 'C' 'Q' 'E' 'G' 'H' 'I' 'L' 'K' 'M' 'F' 'P' 'S' 'T' 'W' 'Y' 'V'
B {0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
J {0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
Z {0.; 0.; 0.; 0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
This is how I interpret the ambiguity codes. Does that look right to you?
I should also note that as you can see in https://github.com/matsen/pplacer/commit/be77c7d07c24b0e7280dd5b7fb8eafab1d0a1544 , for some reason I was previously interpreting B as a synonym for N and Z as a synonym for Q. So making this change will make a minor difference in some folks' analysis in which those letters appear.
Could you pull the new branch and give it a spin?
Thanks! Table looks good to me. I'm a bit swamped at the moment, but should be able to give this a spin in the next few weeks.
No worries! Just let me know if it looks good and I'll merge.
Can you send me the binaries for this new release? We don't have a build environment for pplacer.
We complied the latest code and it looks to work great. I'd say make it official!
https://github.com/matsen/pplacer/releases/tag/v1.1.alpha18 <- here's the new release.
pplacer currently does not handle ambiguous bases. I appreciate that from a ML perspective fully handling such characters is challenging. However, I am wondering if ambiguous bases can simply be treated as unknowns and a warning generated. This would seem preferable to causing a full exception that disallows such sequences to be inserted into a tree:
Such situations are extremely problematic when processing large data sets where quality control over the input sequences can be challenging.