Open Midnighter opened 3 weeks ago
thanks for your effort on this :) it seems to not err any longer, which is a great improvement! i tried w a pretty small/ and mostly manufactured test case built from subsetting and modifying one of the test files in the pr:
#mpa_vOct22_CHOCOPhlAnSGB_202212
#/usr/local/bin/metaphlan --nproc 6 --input_type fastq 2612.merged.fastq.gz --bowtie2out 2612_se_metaphlan4-db.metaphlan.bowtie2out.txt --bowtie2db metaphlan4 --index mpa_vOct22_CHOCOPhlAnSGB_202212 --biom 2612_se_metaphlan4-db.metaphlan.biom --output_file 2612_se_metaphlan4-db.metaphlan_profile.txt
#707440 reads processed
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 89.92809
k__Archaea 2157 10.07191
k__Bacteria|p__Bacteroidetes 2|976 43.03622
k__Bacteria|p__Actinobacteria 2|201174 29.04669
k__Bacteria|p__Firmicutes 2|1239 27.90168
k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801 95.1462
k__Bacteria|p__Firmicutes|c__CFGB9120 2|1239| 2.26587
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified 2|1239| 1.82756
k__Bacteria|p__Firmicutes|c__Bacilli 2|1239|91061 0.90652
and i got:
taxonomy_id count
2 89928090
2157 10071910
976 43036220
201174 29046690
1239 27901680
186801 95146200
91061 906520
i notice there is no entry for 0, which i was expecting in the result. given what i saw quickly in the comments/ code w this pr, i had expected that these two entries in particular would have wound up there:
k__Bacteria|p__Firmicutes|c__CFGB9120 2|1239| 2.26587
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified 2|1239| 1.82756
was that not the intention, or is it a bug? given the coercion to 'counts' by multiplying by a large int, this loss of data seems particularly bad. it seems to me a person could not reliably recreate the relative abundances they got from metaphlan for the properly classified entities without at min retaining the unclassified and un-ncbi-mappable ones as unclassified.
also, as i was looking at this i wondered.. for the two entries just above here, is it possible/ reasonable to assign them to 1239 (Firmicutes) rather than 0 (unclassified)? i didnt immediately think of a reason to not do that, but maybe i missed something.
Thank you for taking such a thorough look at it. It indeed sounds like the implementation is still buggy.
And assigning to the next higher rank that has an ID rather than to unclassified makes so much sense! Thank you for the suggestion ☺️
There was a bug with operator precedence in pandas. Need to always use parentheses...
I thought about your idea regarding ranks some more and now discovered that both the previous implementation and the one that you propose are wrong. The relative abundance at each rank sums to 100%. In your example,
k__Bacteria|p__Bacteroidetes 2|976 43.03622
k__Bacteria|p__Actinobacteria 2|201174 29.04669
k__Bacteria|p__Firmicutes 2|1239 27.90168
Firmicutes comprise ~28% of all bacteria. At the next lower rank we have
k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801 95.1462
k__Bacteria|p__Firmicutes|c__CFGB9120 2|1239| 2.26587
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified 2|1239| 1.82756
k__Bacteria|p__Firmicutes|c__Bacilli 2|1239|91061 0.90652
which sum up to ~100% again. It would be wrong to add those abundances to 1239, as that will break the assumption that each rank sums to 100%. It is equally wrong to add them to the unclassified node. Instead, they should just be dropped from the profile.
I see what you mean... And I suppose the two entities in the example are already counted towards 1239 now I think about it. Though removing entries means a rank won't sum to 100 any longer either, it'll just be below 100 rather than above. In the case they're removed, we're likely creating inconsistency in relative abundances from one taxonomic rank to another and also making it impossible to recreate the relative abundances originally produced by metaphlan. Maybe that's the best short term solution anyhow, but it seems a thing to explain carefully to people.
I wonder what other options we have? Can we build a taxonomy from SGBs or something? Or other ideas not for now, but for a more complete solution later?
Yeah, not great either way. I guess, one could think of one unclassified grouping per rank, but I don't know what identifier to assign to each.
@d-callan would you be up for testing if this works as you would expect, please? You can install this directly with