husonlab / megan-ce

MEGAN Community Edition
GNU General Public License v3.0
65 stars 22 forks source link

Indexing issue with Biom files? #7

Open andrewjmc opened 5 years ago

andrewjmc commented 5 years ago

I am importing Biom files made from Kraken reports (using: https://github.com/smdabdoub/kraken-biom)

I have noticed that in one case counts are not assigned to the correct OTU and a species is missing.

In my Kraken report, I have the following lines

25.56  211687  **3**       G       10509           Mastadenovirus
25.56  **211678**  124056  S       129951            Human mastadenovirus C

Indicating three reads to the Mastadenovirus genus and 211,678 (including subspecies level) to mastadenovirus C. kraken-biom correctly makes the .biom file, with data:

...[2095,0,3.0],[2096,0,211678.0]...

And I confirm that the 2095th and 2096th (0-offset) elements of rows is:

...
{"id": "10509", "metadata": {"taxonomy": ["k__Viruses", "p__", "c__", "o__", "f__Adenoviridae", "g__Mastadenovirus", "s__"]}},
{"id": "129951", "metadata": {"taxonomy": ["k__Viruses", "p__", "c__", "o__", "f__Adenoviridae", "g__Mastadenovirus", "s__Human mastadenovirus C"]}}
...

However, MEGAN6 6.12.5 assigns 211,687 reads to Mastadenovirus and intriguingly, I cannot even uncollapse Mastadenovirus to reveal Human mastadenovirus C.

Nonetheless, Neisseria sicca comes out fine:

30.54  252935  6038    G       482                   Neisseria
 22.06  182723  182723  S       490                     Neisseria sicca

182,731 reads to the species and 6038 to the genus. This is again correctly recorded in the Biom:

[2,0,6038.0],[3,0,182723.0]

Where elements 2 and 3 (0-offset) are indeed the pair we want:

{"id": "482", "metadata": {"taxonomy": ["k__Bacteria", "p__Proteobacteria", "c__Betaproteobacteria", "o__Neisseriales", "f__Neisseriaceae", "g__Neisseria", "s__"]}},{"id": "490", "metadata": {"taxonomy": ["k__Bacteria", "p__Proteobacteria", "c__Betaproteobacteria", "o__Neisseriales", "f__Neisseriaceae", "g__Neisseria", "s__sicca"]}}

I have attached the file in case this helps understand the problem. I have also confirmed the assignments are correct when I read the biom file into R with the biomformat package.

exemplar_biom.txt

Many thanks,

Andrew