Fix metaphlan profiles with duplicate entries by aggregating them

Midnighter commented 3 weeks ago

[x] fix #140
[x] description of feature/fix
[x] tests added/passed
[ ] add an entry to the changelog

@d-callan would you be up for testing if this works as you would expect, please? You can install this directly with

pip install https://github.com/taxprofiler/taxpasta/archive/fix-metaphlan.zip

d-callan commented 3 weeks ago

thanks for your effort on this :) it seems to not err any longer, which is a great improvement! i tried w a pretty small/ and mostly manufactured test case built from subsetting and modifying one of the test files in the pr:

#mpa_vOct22_CHOCOPhlAnSGB_202212
#/usr/local/bin/metaphlan --nproc 6 --input_type fastq 2612.merged.fastq.gz --bowtie2out 2612_se_metaphlan4-db.metaphlan.bowtie2out.txt --bowtie2db metaphlan4 --index mpa_vOct22_CHOCOPhlAnSGB_202212 --biom 2612_se_metaphlan4-db.metaphlan.biom --output_file 2612_se_metaphlan4-db.metaphlan_profile.txt
#707440 reads processed
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       89.92809        
k__Archaea      2157    10.07191        
k__Bacteria|p__Bacteroidetes    2|976   43.03622        
k__Bacteria|p__Actinobacteria   2|201174        29.04669        
k__Bacteria|p__Firmicutes       2|1239  27.90168        
k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801   95.1462 
k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756 
k__Bacteria|p__Firmicutes|c__Bacilli    2|1239|91061    0.90652

and i got:

taxonomy_id     count
2       89928090
2157    10071910
976     43036220
201174  29046690
1239    27901680
186801  95146200
91061   906520

i notice there is no entry for 0, which i was expecting in the result. given what i saw quickly in the comments/ code w this pr, i had expected that these two entries in particular would have wound up there:

k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756

was that not the intention, or is it a bug? given the coercion to 'counts' by multiplying by a large int, this loss of data seems particularly bad. it seems to me a person could not reliably recreate the relative abundances they got from metaphlan for the properly classified entities without at min retaining the unclassified and un-ncbi-mappable ones as unclassified.

also, as i was looking at this i wondered.. for the two entries just above here, is it possible/ reasonable to assign them to 1239 (Firmicutes) rather than 0 (unclassified)? i didnt immediately think of a reason to not do that, but maybe i missed something.

Midnighter commented 3 weeks ago

Thank you for taking such a thorough look at it. It indeed sounds like the implementation is still buggy.

And assigning to the next higher rank that has an ID rather than to unclassified makes so much sense! Thank you for the suggestion ☺️

Midnighter commented 3 weeks ago

There was a bug with operator precedence in pandas. Need to always use parentheses...

Midnighter commented 3 weeks ago

I thought about your idea regarding ranks some more and now discovered that both the previous implementation and the one that you propose are wrong. The relative abundance at each rank sums to 100%. In your example,

k__Bacteria|p__Bacteroidetes    2|976   43.03622        
k__Bacteria|p__Actinobacteria   2|201174        29.04669        
k__Bacteria|p__Firmicutes       2|1239  27.90168

Firmicutes comprise ~28% of all bacteria. At the next lower rank we have

k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801   95.1462 
k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756 
k__Bacteria|p__Firmicutes|c__Bacilli    2|1239|91061    0.90652

which sum up to ~100% again. It would be wrong to add those abundances to 1239, as that will break the assumption that each rank sums to 100%. It is equally wrong to add them to the unclassified node. Instead, they should just be dropped from the profile.

d-callan commented 3 weeks ago

I see what you mean... And I suppose the two entities in the example are already counted towards 1239 now I think about it. Though removing entries means a rank won't sum to 100 any longer either, it'll just be below 100 rather than above. In the case they're removed, we're likely creating inconsistency in relative abundances from one taxonomic rank to another and also making it impossible to recreate the relative abundances originally produced by metaphlan. Maybe that's the best short term solution anyhow, but it seems a thing to explain carefully to people.

I wonder what other options we have? Can we build a taxonomy from SGBs or something? Or other ideas not for now, but for a more complete solution later?

Midnighter commented 3 weeks ago

Yeah, not great either way. I guess, one could think of one unclassified grouping per rank, but I don't know what identifier to assign to each.

taxprofiler / taxpasta

Fix metaphlan profiles with duplicate entries by aggregating them #153