GTDB.r220 taxID vs scientific name

I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.

root_relAbund
root_fragCount
direct_fragCount
rank_code
taxID
scientific_name

I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.

The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.

0.34 64338 0 P 10316302 Bacillota_I 0.01 1877 0 P 10315291 Bacillota_I 0.01 1609 0 P 10316628 Bacillota_I 0.00 870 0 P 10055357 Bacillota_I 0.00 857 0 P 10077854 Bacillota_I 0.00 853 0 P 10085662 Bacillota_I 0.00 587 0 P 10315035 Bacillota_I 0.00 273 0 P 10257521 Bacillota_I 0.00 239 0 P 10221036 Bacillota_I 0.00 239 0 P 10223213 Bacillota_I

Thank you for your help!

mourisl / centrifuger

GTDB.r220 taxID vs scientific name #20