Open shannonmargaret opened 3 hours ago
Thank you for finding this issue! In GTDB, the scientific name should be unique and show up only once at each taxonomy rank in the taxonomy tree. I have found the bug and fixed the code, and am rebuilding the index now. Meanwhile, taking sum based on the scientific name is a good solution. I think this issue will also affect Centrifuger's accuracy for multi-classified reads.
I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.
I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.
The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.
0.34 64338 0 P 10316302 Bacillota_I 0.01 1877 0 P 10315291 Bacillota_I 0.01 1609 0 P 10316628 Bacillota_I 0.00 870 0 P 10055357 Bacillota_I 0.00 857 0 P 10077854 Bacillota_I 0.00 853 0 P 10085662 Bacillota_I 0.00 587 0 P 10315035 Bacillota_I 0.00 273 0 P 10257521 Bacillota_I 0.00 239 0 P 10221036 Bacillota_I 0.00 239 0 P 10223213 Bacillota_I
Thank you for your help!