matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
75 stars 18 forks source link

Handling of ambiguous amino acids #346

Closed donovan-h-parks closed 8 years ago

donovan-h-parks commented 8 years ago

pplacer currently does not handle ambiguous bases. I appreciate that from a ML perspective fully handling such characters is challenging. However, I am wondering if ambiguous bases can simply be treated as unknowns and a warning generated. This would seem preferable to causing a full exception that disallows such sequences to be inserted into a tree:

Uncaught exception: Failure("J is not a known base in GCA_000389905.1_ASM38990v1_protein")
Fatal error: exception Failure("J is not a known base in GCA_000389905.1_ASM38990v1_protein")
Uncaught exception: Sys_error("./bacteria/chunk0/storage/tree/concatenated.pplacer.json: No such file or directory")
Fatal error: exception Sys_error("./bacteria/chunk0/storage/tree/concatenated.pplacer.json: No such file or directory")

Such situations are extremely problematic when processing large data sets where quality control over the input sequences can be challenging.

matsen commented 8 years ago

Hello Donovan!

Is J an ambiguous base? I haven't heard of it: http://www.bioinformatics.org/sms/iupac.html

donovan-h-parks commented 8 years ago

Guess it depends on who you ask: http://www.insdc.org/files/feature_table.html#7.4.3

Also, it appears in a non-trivial number of genomes from GenBank. :)

matsen commented 8 years ago

Hey @dparks1134 --

Thanks for your patience with this. I have implemented this, and as a diagnostic I have the following table:

  'A' 'R' 'N' 'D' 'C' 'Q' 'E' 'G' 'H' 'I' 'L' 'K' 'M' 'F' 'P' 'S' 'T' 'W' 'Y' 'V'
B {0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
J {0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
Z {0.; 0.; 0.; 0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}

This is how I interpret the ambiguity codes. Does that look right to you?

I should also note that as you can see in https://github.com/matsen/pplacer/commit/be77c7d07c24b0e7280dd5b7fb8eafab1d0a1544 , for some reason I was previously interpreting B as a synonym for N and Z as a synonym for Q. So making this change will make a minor difference in some folks' analysis in which those letters appear.

Could you pull the new branch and give it a spin?

donovan-h-parks commented 8 years ago

Thanks! Table looks good to me. I'm a bit swamped at the moment, but should be able to give this a spin in the next few weeks.

matsen commented 8 years ago

No worries! Just let me know if it looks good and I'll merge.

donovan-h-parks commented 8 years ago

Can you send me the binaries for this new release? We don't have a build environment for pplacer.

donovan-h-parks commented 8 years ago

We complied the latest code and it looks to work great. I'd say make it official!

matsen commented 8 years ago

https://github.com/matsen/pplacer/releases/tag/v1.1.alpha18 <- here's the new release.