smdabdoub / kraken-biom

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).
MIT License
47 stars 15 forks source link

Taxonomic hierarchy incorrect for certain cases #4

Closed ressy closed 6 years ago

ressy commented 7 years ago

We've found that in some edge cases, the taxonomic hierarchy kraken-biom assigns to a given ID is incorrect. It looks like the assumption that a given rank in the Kraken report always falls under the most recent higher rank (for example, for a given "S" entry for species, the closest "P" for phylum in the previous lines in the file) is not always true.

This is with kraken-biom 1.0.1 under Python 3.5 (in Anaconda) on Linux.

Here's a chunk from an example Kraken report, where I'm specifically looking at 5693 (Trypanosoma cruzi) near the bottom.

...
  0.00  1       0       P       3041            Chlorophyta
  0.00  1       1       C       75966             Trebouxiophyceae
  0.00  2       0       -       556282        Jakobida
  0.00  2       0       G       221723          Seculamonas
  0.00  2       2       S       221724            Seculamonas ecuadoriensis
  0.00  1       0       -       33682         Euglenozoa
  0.00  1       0       O       5653            Kinetoplastida
  0.00  1       0       F       5654              Trypanosomatidae
  0.00  1       0       G       5690                Trypanosoma
  0.00  1       0       -       47570                 Schizotrypanum
  0.00  1       0       S       5693                    Trypanosoma cruzi
  0.00  1       1       -       353153                    Trypanosoma cruzi strain CL Brener
...

Looking from ID 5693 on up, in terms of indentation in the last column: Kraken shows taxa up through "O", then an un-ranked taxon (Euglenozoa), and then nothing for a very long time until Eukaryota many lines above (not shown). The phylum Chlorophyta and class Trebouxiophyceae do not actually contain Trypanosoma cruzi; they're just the closest previous phylum and class shown above that species in the file. But kraken-biom's output gives this Consensus Lineage for ID 5693:

k__Eukaryota; p__Chlorophyta; c__Trebouxiophyceae; o__Kinetoplastida; f__Trypanosomatidae; g__Trypanosoma; s__cruzi

The NCBI Taxonomy Browser seems to match what I saw in Kraken, with Kingdom=Eukaryota; Unranked=Euglenozoa; Order=Kinetoplastida; Family=Trypanosomatidae; etc. (No explicit phylum or class listed.)

I can't say for sure because the Kraken documentation doesn't go into detail, but it looks to me like it's the indentation for the scientific name that corresponds to the hierarchy and to what rank sits above a given entry, and not necessarily the rank of the previous taxa listed. So in my case even though the previous "P" in the report file is Chlorophyta, that group doesn't actually include ID 5693, so we shouldn't have a phylum or class assigned.

smdabdoub commented 7 years ago

Thank you for reporting this issue. I primarily deal with bacteria and some archea, which usually have more standard taxonomy assignments than other organisms.

If could send me a more complete example, i.e. including the root, it would be helpful in reworking the parsing code and to update the test suite. My email address is listed under my GitHub profile. Also feel free to send along any additional examples you might have handy.

I'll keep this issue updated with the progress and any additional questions I might have.

ressy commented 7 years ago

Thanks Shareef! I'll send you a couple of full kraken reports so you can see what I mean.

I have a rough draft of a function to parse these weird assignments the way I expect; if I can get that tidied up and tested properly I'll send a PR too.

ressy commented 7 years ago

There, my change in c1ac609 handles this issue for a new test case. How does that look? Using the tax_lvl_depth mutable default argument for parse_tax_lvl() is a little weird, but let me know if that looks OK to you and I could do a pull request.

It's not much of a change in terms of number of lines, but I know it's a big change from how you currently parse the taxonomy information. I have a bunch of existing biom files from previous runs with many classifications, so I should be able to do a before-and-after comparison and spot check what changes to make absolutely sure this looks to be doing the right thing. On the other hand, if you make a different implementation, I'll be ready to try it out in the same way. (And in any case thanks for having your code laid out clearly and documented-- that helped a ton with troubleshooting things here!)

ressy commented 7 years ago

OK, I did a comparison with the kraken-biom output for one set of (very large) kraken reports. Out of 11,935 taxa present in the output biom file, 722 of them have changes somewhere in the hierarchy text. Here are the side-by-side differences for those cases as a TSV file with these columns:

smdabdoub commented 7 years ago

That's great, thanks for all the work you've put in! From a quick glance the code looks pretty reasonable. I'll take a more detailed look and see if I want to make any changes.

I had actually been planning to make a move in this direction since it was going to be difficult to implement support for subspecies/strains/etc with my original approach. So once we get these changes finalized it shouldn't be too much more work to finally remove that TODO.

Also, I'm glad the documentation and layout was a help. This is exactly the kind of contribution I was hoping to enable.

ressy commented 7 years ago

OK, sounds good! Glad this might end up helping with a pending feature, too.

smdabdoub commented 6 years ago

Hey @ressy! Just wanted to ping you on this issue. I finally got some time to sit down and look at things, and I even found some cases in my own data that were showing up incorrectly without this fix. So if you'd like to issue a pull request, I'd be happy to incorporate your code. There have been a couple minor additions in the meantime, but they shouldn't conflict much or at all with your edits. If you're short on time I can integrate the changes from the commit you shared. Thanks again!

ressy commented 6 years ago

Sure thing, just opened a pull request. I don't think any of your recent changes will conflict.