Open MajoroMask opened 1 year ago
Thank you for the detailed report. It is somewhat curious that the names of the species are distinguished by a letter suffix, but the numeric identifier is the same... And no identifiers for the strains at all. I will look into it but I'm actually not sure what the correct solution should be.
@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.
Can you think of a better solution?
I have same error by metaphlan,how to slove it? I download it(http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/mpa_vJan21_CHOCOPhlAnSGB_202103_bt2.tar)
@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.
Can you think of a better solution?
@Midnighter I got no idea... can author of MetaPhlAn 4 be reached? Maybe they have a solution for generating an ID to the output.
another ex if it helps:
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143 2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159 2|1239|186801|186802|||1898207| 0.02308
@d-callan thank you for the additional data. Do you have any thoughts on the following? I'm not clear on how to solve this at the moment.
only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.
@Midnighter I think that's as reasonable as anything.. if I want strains from biobakery tools I'd look to strainphlan rather than metaphlan. It's possible a warning would be good, or maybe making the behavior configurable.
@Midnighter im also wondering if you have a sense for what this would take in terms of effort? i am very interested in getting this working, and would be willing to put effort to it if you wanted.
I think, code change is minimal. 2-3 lines. Will need an extra test case or so.
this one i think is fun
k__Bacteria|p__Firmicutes|c__Negativicutes 2|1239|909932 5.20485
k__Bacteria|p__Actinobacteria|c__Actinomycetia 2|201174|1760 2.05981
k__Bacteria|p__Firmicutes|c__CFGB2834 2|1239| 0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria 2|1224|28216 0.81827
k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998 0.46979
k__Bacteria|p__Firmicutes|c__CFGB1227 2|1239| 0.404
k__Bacteria|p__Firmicutes|c__CFGB3038 2|1239| 0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054 2|1239| 0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified 2|1239| 0.12308
k__Bacteria|p__Firmicutes|c__CFGB2906 2|1239| 0.03655
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria 2|1224|1236 0.02883
k__Bacteria|p__Firmicutes|c__CFGB1765 2|1239| 0.02468
k__Bacteria|p__Candidatus_Melainabacteria|c__Candidatus_Melainabacteria_unclassified 2|1798710| 0.00509
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales 2|976|200643|171549 57.0883
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales 2|74201|203494|48461 7.59373
k__Bacteria|p__Firmicutes|c__Negativicutes|o__Veillonellales 2|1239|909932|1843489 5.20485
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834 2|1239|| 0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales 2|1224|28216|80840 0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227 2|1239|| 0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales 2|201174|84998|84999 0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038 2|1239|| 0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054|o__OFGB3054 2|1239|| 0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified|o__Firmicutes_unclassified 2|1239|| 0.12308
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriaceae 2|1239|186801|186802|186806 1.31747
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834|f__FGB2834 2|1239||| 0.94398
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Clostridiaceae 2|1239|186801|186802|31979 0.85
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae 2|1224|28216|80840|995019 0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227|f__FGB1227 2|1239||| 0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae 2|201174|84998|84999|84107 0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038|f__FGB3038 2|1239||| 0.18149
Sorry for randomly jumping in here, but I have used MetaPhlAn a fair bit. The clade tax id values come from NCBI, but the taxa/clade name are coming from their own clustering/GTDB.
I believe that the authors even kind of discourage using the tax ids.
I don't know if this would cause problems when merging across profilers, but you could add/use the last section of the clade name.
Ex:
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143 2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159 2|1239|186801|186802|||1898207| 0.02308
becomes
1898207_SGB15143 0.02366
1898207_SGB15159 0.02308
Hi @harper357,
No need to apologize, more information is always welcome. Thank you for the explanation also, I was not aware how MetaPhlAn handles this.
Unfortunately, even though such a change looks simple from the outside, it would change taxpasta's internal logic a lot. There's not only the validation part which assumes integers, but also the whole integration with an existing taxonomy. Basically, we only maintain the identifiers and if users desire, we add back names and lineages using the identifiers to get information from a taxonomy. @harper357 do you know if they publish their taxonomy in a format that can be read by taxopy?
@Midnighter I am not completely sure on what the format for taxopy
is. MetaPhlAn 4's second column is the NCBI TaxIDs. Are you talking about the first column that needs to be in a different format?
We use taxopy to load taxonomies in taxdump format. That means, we normally drop all information from individual profiles except taxon identifiers and their relative abundances. If a user wishes to output names, ranks, or lineages, we retrieve that from the taxonomy.
There are two things that concern me with MetaPhlAn then. 1) You say that they use NCBI identifiers, but actually use a custom clustering. I don't know if that will practically make a big difference, but it's nonetheless misleading if true. 2) If they have their own clustering, it is straight forward to create the taxdump output, which will also assign unique identifiers that can be used.
I realize that that will not happen soon, so we still need a solution right now. While I like your suggestion @harper357, it does have big consequences for how taxpasta is built. Need to think about that. It would also mean that the way we use taxonomies would not work for MetaPhlAn.
I'll put this here in case it proves a helpful reference http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html
Also, I'll comment that mapping metaphlan outputs to ncbi taxonomy seems a reasonable use case nonetheless and makes sense to support even if imperfectly.
having thought about it a bit since yesterday, maybe we need this to actually be two issues? one for supporting metaphlan on the ncbi taxonomy using the solution previously suggested by @Midnighter, w any warnings/ flags necessary. that sounded like itd be easy enough to do to serve as an interim solution and is probably worth supporting anyhow. and then a second issue for supporting metaphlan using a taxonomy built on SGBs. thatd be the more complete solution..
Is there an existing issue for this?
Problem description
As in title, this report is forward from https://github.com/nf-core/taxprofiler/issues/396.
The MetaPhlAn 4 output I'm using are in these gists, if any help: 2612_se_metaphlan4-db.metaphlan_profile.txt and 2613_se_metaphlan4-db.metaphlan_profile.txt
I think it's the duplicated tax id (as shown below) caused the error.
I also run
taxpasta standardise
on both MetaPhlAn 4 output files, taxpasta works but the result may have problem.From this result 'standard_2612.tsv' I got 4 entries with the same tax id and different count:
Code sample
Code run:
Traceback:
Traceback is too long, see this gist
At the end it says:
Environment
I'm running taxpastat under local docker container, which runs quay.io/biocontainers/taxpasta:0.6.1--pyhdfd78af_0
Anything else?
No response