Closed ressy closed 6 years ago
Thank you for reporting this issue. I primarily deal with bacteria and some archea, which usually have more standard taxonomy assignments than other organisms.
If could send me a more complete example, i.e. including the root, it would be helpful in reworking the parsing code and to update the test suite. My email address is listed under my GitHub profile. Also feel free to send along any additional examples you might have handy.
I'll keep this issue updated with the progress and any additional questions I might have.
Thanks Shareef! I'll send you a couple of full kraken reports so you can see what I mean.
I have a rough draft of a function to parse these weird assignments the way I expect; if I can get that tidied up and tested properly I'll send a PR too.
There, my change in c1ac609 handles this issue for a new test case. How does that look? Using the tax_lvl_depth mutable default argument for parse_tax_lvl()
is a little weird, but let me know if that looks OK to you and I could do a pull request.
It's not much of a change in terms of number of lines, but I know it's a big change from how you currently parse the taxonomy information. I have a bunch of existing biom files from previous runs with many classifications, so I should be able to do a before-and-after comparison and spot check what changes to make absolutely sure this looks to be doing the right thing. On the other hand, if you make a different implementation, I'll be ready to try it out in the same way. (And in any case thanks for having your code laid out clearly and documented-- that helped a ton with troubleshooting things here!)
OK, I did a comparison with the kraken-biom output for one set of (very large) kraken reports. Out of 11,935 taxa present in the output biom file, 722 of them have changes somewhere in the hierarchy text. Here are the side-by-side differences for those cases as a TSV file with these columns:
That's great, thanks for all the work you've put in! From a quick glance the code looks pretty reasonable. I'll take a more detailed look and see if I want to make any changes.
I had actually been planning to make a move in this direction since it was going to be difficult to implement support for subspecies/strains/etc with my original approach. So once we get these changes finalized it shouldn't be too much more work to finally remove that TODO.
Also, I'm glad the documentation and layout was a help. This is exactly the kind of contribution I was hoping to enable.
OK, sounds good! Glad this might end up helping with a pending feature, too.
Hey @ressy! Just wanted to ping you on this issue. I finally got some time to sit down and look at things, and I even found some cases in my own data that were showing up incorrectly without this fix. So if you'd like to issue a pull request, I'd be happy to incorporate your code. There have been a couple minor additions in the meantime, but they shouldn't conflict much or at all with your edits. If you're short on time I can integrate the changes from the commit you shared. Thanks again!
Sure thing, just opened a pull request. I don't think any of your recent changes will conflict.
We've found that in some edge cases, the taxonomic hierarchy kraken-biom assigns to a given ID is incorrect. It looks like the assumption that a given rank in the Kraken report always falls under the most recent higher rank (for example, for a given "S" entry for species, the closest "P" for phylum in the previous lines in the file) is not always true.
This is with kraken-biom 1.0.1 under Python 3.5 (in Anaconda) on Linux.
Here's a chunk from an example Kraken report, where I'm specifically looking at 5693 (Trypanosoma cruzi) near the bottom.
Looking from ID 5693 on up, in terms of indentation in the last column: Kraken shows taxa up through "O", then an un-ranked taxon (Euglenozoa), and then nothing for a very long time until Eukaryota many lines above (not shown). The phylum Chlorophyta and class Trebouxiophyceae do not actually contain Trypanosoma cruzi; they're just the closest previous phylum and class shown above that species in the file. But kraken-biom's output gives this Consensus Lineage for ID 5693:
The NCBI Taxonomy Browser seems to match what I saw in Kraken, with Kingdom=Eukaryota; Unranked=Euglenozoa; Order=Kinetoplastida; Family=Trypanosomatidae; etc. (No explicit phylum or class listed.)
I can't say for sure because the Kraken documentation doesn't go into detail, but it looks to me like it's the indentation for the scientific name that corresponds to the hierarchy and to what rank sits above a given entry, and not necessarily the rank of the previous taxa listed. So in my case even though the previous "P" in the report file is Chlorophyta, that group doesn't actually include ID 5693, so we shouldn't have a phylum or class assigned.