taxprofiler / taxpasta

TAXnomic Profile Aggregation and STAndardisation
https://taxpasta.readthedocs.io/
Apache License 2.0
33 stars 7 forks source link

[BUG] Metaphlan4 run error in Microba datasets #143

Closed minhtrung1997 closed 6 months ago

minhtrung1997 commented 8 months ago

Is there an existing issue for this?

Problem description

I try to run this command in other to get the taxpasta file

taxpasta standardize -p metaphlan -o <wkdir>/Microba_session1-nomerge/metaphlan/metaphlan4-db/ani100_cLOW_stTrue_r5_pe_null_metaphlan4-db.metaphlan_profile.txt.taxpasta.tsv --add-lineage                     --taxonomy /home/ktest/pipeline_env/database/NCBI_taxdump-230711/taxdump/   <wkdir>/Microba_session1-nomerge/metaphlan/metaphlan4-db/ani100_cLOW_stTrue_r5_pe_null_metaphlan4-db.metaphlan_profile.txt

It present the error: image

Code sample

Code run:

Traceback:

Environment

### Package Information | Package | Version | |:---------|--------:| | taxpasta | 0.6.1 | ### Dependency Information | Package | Version | |:-----------------------------|------------:| | bash-kernel | **missing** | | biom-format | 2.1.15 | | depinfo~ | **missing** | | jupyter | **missing** | | mkdocs-awesome-pages-plugin~ | **missing** | | mkdocs-exclude~ | **missing** | | mkdocs-material~ | **missing** | | mkdocstrings[python]~ | **missing** | | numpy~ | **missing** | | odfpy | **missing** | | openpyxl | **missing** | | pandas~ | **missing** | | pandera~ | **missing** | | pre-commit | **missing** | | pyarrow | 14.0.2 | | rich | 13.7.0 | | tabulate~ | **missing** | | taxopy~ | **missing** | | tox~ | **missing** | | typer~ | **missing** | ### Build Tools Information | Package | Version | |:-----------|--------:| | pip | 23.3.2 | | setuptools | 69.0.3 | | wheel | 0.42.0 | ### Platform Information | | | |:--------|-------------------------:| | Linux | 5.15.0-91-generic-x86_64 | | CPython | 3.12.1 |

Anything else?

Error dataframe; image

jfy133 commented 8 months ago

@Midnighter is this a summing to 100% issue again?

Midnighter commented 8 months ago

Yes, it is 😬

jfy133 commented 8 months ago

Freaking Maths Shakes fist

minhtrung1997 commented 7 months ago

Currently, I can circumvented this with this notebook extract_metaphlan_profile.ipynb.txt The output is pretty well like taxpasta Please check this and see if it can help update taxpasta

Midnighter commented 7 months ago

Thanks for the effort @minhtrung1997, but parsing the file is not really the issue. We perform a number of validations on profiles passed to taxpasta, and one of them checks whether the relative abundances per rank add up to 100%. Those checks regularly fail due to floating point arithmetic and truncated outputs when profilers write to a text file. (Typically, they only write 6 decimal places which is insufficient when we have tens of millions of reads.)

We could add an option to skip the validation of compositionality.

jfy133 commented 7 months ago

I think the option skipping of compositionality might be a good idea given how often that particular check comes up. Maybe something generic like --leniant ?

Midnighter commented 7 months ago

I think, I'd use leniant mode for turning off a whole bunch of validations, and for this single one maybe allow defining the acceptable absolute deviation from 100%? Something like --compositionality-threshold 0.1 for up to 10% off?

jfy133 commented 7 months ago

Sounds good to me!