mourisl / centrifuger

Classifier for metagenomic sequences using FM-index with run-block compressed BWT.
MIT License
50 stars 4 forks source link

GTDB.r220 taxID vs scientific name #20

Open shannonmargaret opened 3 hours ago

shannonmargaret commented 3 hours ago

I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.

  1. root_relAbund
  2. root_fragCount
  3. direct_fragCount
  4. rank_code
  5. taxID
  6. scientific_name

I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.

The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.

0.34 64338 0 P 10316302 Bacillota_I 0.01 1877 0 P 10315291 Bacillota_I 0.01 1609 0 P 10316628 Bacillota_I 0.00 870 0 P 10055357 Bacillota_I 0.00 857 0 P 10077854 Bacillota_I 0.00 853 0 P 10085662 Bacillota_I 0.00 587 0 P 10315035 Bacillota_I 0.00 273 0 P 10257521 Bacillota_I 0.00 239 0 P 10221036 Bacillota_I 0.00 239 0 P 10223213 Bacillota_I

Thank you for your help!

mourisl commented 2 hours ago

Thank you for finding this issue! In GTDB, the scientific name should be unique and show up only once at each taxonomy rank in the taxonomy tree. I have found the bug and fixed the code, and am rebuilding the index now. Meanwhile, taking sum based on the scientific name is a good solution. I think this issue will also affect Centrifuger's accuracy for multi-classified reads.