taxprofiler / taxpasta

TAXnomic Profile Aggregation and STAndardisation
https://taxpasta.readthedocs.io/
Apache License 2.0
34 stars 7 forks source link

[BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported #140

Open MajoroMask opened 1 year ago

MajoroMask commented 1 year ago

Is there an existing issue for this?

Problem description

As in title, this report is forward from https://github.com/nf-core/taxprofiler/issues/396.

The MetaPhlAn 4 output I'm using are in these gists, if any help: 2612_se_metaphlan4-db.metaphlan_profile.txt and 2613_se_metaphlan4-db.metaphlan_profile.txt

I think it's the duplicated tax id (as shown below) caused the error.

cat 2612_se_metaphlan4-db.metaphlan_profile.txt | grep '165179'
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A        2|976|200643|171549|171552|838|165179       15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C        2|976|200643|171549|171552|838|165179       3.48391
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B        2|976|200643|171549|171552|838|165179       1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F        2|976|200643|171549|171552|838|165179       0.34791
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A|t__SGB1626     2|976|200643|171549|171552|838|165179|      15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C|t__SGB1644     2|976|200643|171549|171552|838|165179|      3.48391 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_TF12_30,k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_AM23_5
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B|t__SGB1613     2|976|200643|171549|171552|838|165179|      1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F|t__SGB1614     2|976|200643|171549|171552|838|165179|      0.34791

I also run taxpasta standardise on both MetaPhlAn 4 output files, taxpasta works but the result may have problem.

taxpasta standardise -p metaphlan -o standard_2612.tsv 2612_se_metaphlan4-db.metaphlan_profile.txt
[02:43:32] WARNING  Combining 122 entries with unclassified taxa in the profile.             metaphlan_profile_standardisation_service.py:94
           INFO     Write result to 'standard_2612.tsv'.

From this result 'standard_2612.tsv' I got 4 entries with the same tax id and different count:

cat standard_2612.tsv | grep '^165179\b'
165179  15157120
165179  3483910
165179  1311970
165179  347910

Code sample

Code run:

taxpasta merge \
    -p metaphlan -o metaphlan_metaphlan4-db.tsv --add-name --add-rank --add-lineage --add-id-lineage --add-rank-lineage \
    --taxonomy taxdump \
     \
    2612_se_metaphlan4-db.metaphlan_profile.txt 2613_se_metaphlan4-db.metaphlan_profile.txt

Traceback:

Traceback is too long, see this gist

At the end it says:

ValueError: Index has duplicate keys: CategoricalIndex([165179], categories=[0, 
2, 468, 469, ..., 2003188, 2082587, 2292893, 2887326], ordered=False, 
dtype='category', name='taxonomy_id')

Environment

I'm running taxpastat under local docker container, which runs quay.io/biocontainers/taxpasta:0.6.1--pyhdfd78af_0

Anything else?

No response

Midnighter commented 1 year ago

Thank you for the detailed report. It is somewhat curious that the names of the species are distinguished by a letter suffix, but the numeric identifier is the same... And no identifiers for the strains at all. I will look into it but I'm actually not sure what the correct solution should be.

Midnighter commented 10 months ago

@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

Can you think of a better solution?

luozhy88 commented 9 months ago

I have same error by metaphlan,how to slove it? I download it(http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/mpa_vJan21_CHOCOPhlAnSGB_202103_bt2.tar)

image image

MajoroMask commented 8 months ago

@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

Can you think of a better solution?

@Midnighter I got no idea... can author of MetaPhlAn 4 be reached? Maybe they have a solution for generating an ID to the output.

d-callan commented 4 months ago

another ex if it helps:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143     2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159     2|1239|186801|186802|||1898207| 0.02308 
Midnighter commented 4 months ago

@d-callan thank you for the additional data. Do you have any thoughts on the following? I'm not clear on how to solve this at the moment.

only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

d-callan commented 4 months ago

@Midnighter I think that's as reasonable as anything.. if I want strains from biobakery tools I'd look to strainphlan rather than metaphlan. It's possible a warning would be good, or maybe making the behavior configurable.

d-callan commented 4 months ago

@Midnighter im also wondering if you have a sense for what this would take in terms of effort? i am very interested in getting this working, and would be willing to put effort to it if you wanted.

Midnighter commented 4 months ago

I think, code change is minimal. 2-3 lines. Will need an extra test case or so.

d-callan commented 4 months ago

this one i think is fun

k__Bacteria|p__Firmicutes|c__Negativicutes      2|1239|909932   5.20485
k__Bacteria|p__Actinobacteria|c__Actinomycetia  2|201174|1760   2.05981
k__Bacteria|p__Firmicutes|c__CFGB2834   2|1239| 0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria     2|1224|28216    0.81827
k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998  0.46979
k__Bacteria|p__Firmicutes|c__CFGB1227   2|1239| 0.404
k__Bacteria|p__Firmicutes|c__CFGB3038   2|1239| 0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054   2|1239| 0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 0.12308
k__Bacteria|p__Firmicutes|c__CFGB2906   2|1239| 0.03655
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    2|1224|1236     0.02883
k__Bacteria|p__Firmicutes|c__CFGB1765   2|1239| 0.02468
k__Bacteria|p__Candidatus_Melainabacteria|c__Candidatus_Melainabacteria_unclassified    2|1798710|      0.00509
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales     2|976|200643|171549     57.0883
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales        2|74201|203494|48461    7.59373
k__Bacteria|p__Firmicutes|c__Negativicutes|o__Veillonellales    2|1239|909932|1843489   5.20485
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834       2|1239||        0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales  2|1224|28216|80840      0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227       2|1239||        0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales     2|201174|84998|84999    0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038       2|1239||        0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054|o__OFGB3054       2|1239||        0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified|o__Firmicutes_unclassified 2|1239||        0.12308
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriaceae      2|1239|186801|186802|186806     1.31747
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834|f__FGB2834    2|1239|||       0.94398
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Clostridiaceae      2|1239|186801|186802|31979      0.85
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae        2|1224|28216|80840|995019       0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227|f__FGB1227    2|1239|||       0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae        2|201174|84998|84999|84107      0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038|f__FGB3038    2|1239|||       0.18149
harper357 commented 3 months ago

Sorry for randomly jumping in here, but I have used MetaPhlAn a fair bit. The clade tax id values come from NCBI, but the taxa/clade name are coming from their own clustering/GTDB.

I believe that the authors even kind of discourage using the tax ids.

I don't know if this would cause problems when merging across profilers, but you could add/use the last section of the clade name.

Ex:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143     2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159     2|1239|186801|186802|||1898207| 0.02308 

becomes


1898207_SGB15143 0.02366
1898207_SGB15159 0.02308 
Midnighter commented 3 months ago

Hi @harper357,

No need to apologize, more information is always welcome. Thank you for the explanation also, I was not aware how MetaPhlAn handles this.

Unfortunately, even though such a change looks simple from the outside, it would change taxpasta's internal logic a lot. There's not only the validation part which assumes integers, but also the whole integration with an existing taxonomy. Basically, we only maintain the identifiers and if users desire, we add back names and lineages using the identifiers to get information from a taxonomy. @harper357 do you know if they publish their taxonomy in a format that can be read by taxopy?

harper357 commented 3 months ago

@Midnighter I am not completely sure on what the format for taxopy is. MetaPhlAn 4's second column is the NCBI TaxIDs. Are you talking about the first column that needs to be in a different format?

Midnighter commented 3 months ago

We use taxopy to load taxonomies in taxdump format. That means, we normally drop all information from individual profiles except taxon identifiers and their relative abundances. If a user wishes to output names, ranks, or lineages, we retrieve that from the taxonomy.

There are two things that concern me with MetaPhlAn then. 1) You say that they use NCBI identifiers, but actually use a custom clustering. I don't know if that will practically make a big difference, but it's nonetheless misleading if true. 2) If they have their own clustering, it is straight forward to create the taxdump output, which will also assign unique identifiers that can be used.

I realize that that will not happen soon, so we still need a solution right now. While I like your suggestion @harper357, it does have big consequences for how taxpasta is built. Need to think about that. It would also mean that the way we use taxonomies would not work for MetaPhlAn.

d-callan commented 3 months ago

I'll put this here in case it proves a helpful reference http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html

Also, I'll comment that mapping metaphlan outputs to ncbi taxonomy seems a reasonable use case nonetheless and makes sense to support even if imperfectly.

d-callan commented 3 months ago

having thought about it a bit since yesterday, maybe we need this to actually be two issues? one for supporting metaphlan on the ncbi taxonomy using the solution previously suggested by @Midnighter, w any warnings/ flags necessary. that sounded like itd be easy enough to do to serve as an interim solution and is probably worth supporting anyhow. and then a second issue for supporting metaphlan using a taxonomy built on SGBs. thatd be the more complete solution..