taxprofiler / taxpasta

TAXnomic Profile Aggregation and STAndardisation
https://taxpasta.readthedocs.io/
Apache License 2.0
34 stars 7 forks source link

[BUG] (I think): "Unrecognized file type extension '.biom'" #138

Closed emilyvansyoc closed 10 months ago

emilyvansyoc commented 1 year ago

Is there an existing issue for this?

Problem description

Thanks for writing such a handy and helpful software!

I am trying to convert krakenUniq and kraken2 reports to a biom table. Kraken-biom was always my go-to but it doesn't support krakenUniq.

I ran the following under a bioconda install: pip install 'taxpasta[biom]'

taxpasta --install-completion (by the way, I didn't see this described in the READme, and only saw it in the help page for taxpasta. I'm not sure if it needs to be run or not)

Then, restarted the shell 'taxpasta --show-completion` which printed: _taxpasta_completion() { local IFS=$' ' COMPREPLY=( $( env COMP_WORDS="${COMP_WORDS[*]}" \ COMP_CWORD=$COMP_CWORD \ _TAXPASTA_COMPLETE=complete_bash $1 ) ) return 0 }

complete -o default -F _taxpasta_completion taxpasta

Then, 'taxpasta standardize -p krakenuniq -o test.biom KUNIQREPORT.txt`

It gives the following error:

[12:44:38] CRITICAL Unrecognized file type extension '.biom'. standardise.py:68 CRITICAL Please rename the output or set the '--output-format' explicitly.

Code sample

Code run:

conda install -c bioconda taxpasta
pip install 'taxpasta[biom]'
taxpasta --install-completion
# exit and restart terminal
taxpasta --show-completion
taxpasta standardize -p krakenuniq -o test.biom KUNIQREPORT.txt

Traceback:

[12:44:38] CRITICAL Unrecognized file type extension '.biom'.                                                                          standardise.py:68
           CRITICAL Please rename the output or set the '--output-format' explicitly.       

Environment

### note that I'm running this on an institutional cluster $ depinfo --markdown taxpasta ### Package Information | Package | Version | |:---------|--------:| | taxpasta | 0.6.0 | ### Dependency Information | Package | Version | |:-----------------------------|------------:| | bash-kernel | **missing** | | biom-format | 2.1.15 | | depinfo~ | **missing** | | jupyter | **missing** | | mkdocs-awesome-pages-plugin~ | **missing** | | mkdocs-exclude~ | **missing** | | mkdocs-material~ | **missing** | | mkdocstrings[python]~ | **missing** | | numpy~ | **missing** | | odfpy | **missing** | | openpyxl | **missing** | | pandas~ | **missing** | | pandera~ | **missing** | | pre-commit | **missing** | | pyarrow | 11.0.0 | | rich | 13.5.1 | | tabulate~ | **missing** | | taxopy~ | **missing** | | tox~ | **missing** | | typer~ | **missing** | ### Build Tools Information | Package | Version | |:-----------|--------:| | pip | 23.2.1 | | setuptools | 68.0.0 | | wheel | 0.38.4 | ### Platform Information | | | |:--------|------------------------------------:| | Linux | 4.18.0-477.21.1.el8_8.x86_64-x86_64 | | CPython | 3.9.17 |

Anything else?

No response

jfy133 commented 1 year ago

I can confirm the bug using one of the taxprofiler KU outputs, but also KMCP output so I think it's generalised @Midnighter

Midnighter commented 1 year ago

The reason for this is that so far, we don't support BIOM output for the standardize command. If you look at the help text it's not among the supported formats. The reason for that is that BIOM is always meant for multiple samples. Of course, one could create a BIOM file with a single column for only one sample, but I don't quite see the point of it.

Do you have a need for it to be in BIOM format?

emilyvansyoc commented 1 year ago

Thanks so much for your rapid reply!

Unfortunately I can't get a BIOM table out of merge either:

taxpasta merge -p krakenuniq -o test.biom kuniq.txt kuniq1.txt

[09:48:05] CRITICAL The desired file format 'BIOM' is currently not merge.py:334 available. Please pip install 'taxpasta[biom]' to support it.

pip install 'taxpasta[biom]'

Requirement already satisfied: taxpasta[biom] in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (0.6.0) Requirement already satisfied: depinfo~=2.2 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (2.2.0) Requirement already satisfied: numpy~=1.20 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (1.24.3) Requirement already satisfied: pandas~=1.4 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (1.5.3) Requirement already satisfied: pandera~=0.14 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (0.17.1) Requirement already satisfied: taxopy~=0.10 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (0.12.0) Requirement already satisfied: typer~=0.6 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (0.9.0) Requirement already satisfied: biom-format in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from taxpasta[biom]) (2.1.15) Requirement already satisfied: python-dateutil>=2.8.1 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandas~=1.4->taxpasta[biom]) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandas~=1.4->taxpasta[biom]) (2023.3.post1) Requirement already satisfied: multimethod in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandera~=0.14->taxpasta[biom]) (1.9.1) Requirement already satisfied: packaging>=20.0 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandera~=0.14->taxpasta[biom]) (23.1) Requirement already satisfied: pydantic in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandera~=0.14->taxpasta[biom]) (2.4.2) Requirement already satisfied: typeguard>=3.0.2 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandera~=0.14->taxpasta[biom]) (4.1.5) Requirement already satisfied: typing-inspect>=0.6.0 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandera~=0.14->taxpasta[biom]) (0.9.0) Requirement already satisfied: wrapt in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pandera~=0.14->taxpasta[biom]) (1.15.0) Requirement already satisfied: click<9.0.0,>=7.1.1 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from typer~=0.6->taxpasta[biom]) (8.1.7) Requirement already satisfied: typing-extensions>=3.7.4.3 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from typer~=0.6->taxpasta[biom]) (4.7.1) Requirement already satisfied: scipy>=1.3.1 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from biom-format->taxpasta[biom]) (1.8.1) Requirement already satisfied: h5py in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from biom-format->taxpasta[biom]) (3.7.0) Requirement already satisfied: six>=1.5 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas~=1.4->taxpasta[biom]) (1.16.0) Requirement already satisfied: importlib-metadata>=3.6 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from typeguard>=3.0.2->pandera~=0.14->taxpasta[biom]) (6.8.0) Requirement already satisfied: mypy-extensions>=0.3.0 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from typing-inspect>=0.6.0->pandera~=0.14->taxpasta[biom]) (1.0.0) Requirement already satisfied: annotated-types>=0.4.0 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pydantic->pandera~=0.14->taxpasta[biom]) (0.5.0) Requirement already satisfied: pydantic-core==2.10.1 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from pydantic->pandera~=0.14->taxpasta[biom]) (2.10.1) Requirement already satisfied: zipp>=0.5 in /storage/work/epb5360/conda_envs/myenv/lib/python3.9/site-packages (from importlib-metadata>=3.6->typeguard>=3.0.2->pandera~=0.14->taxpasta[biom]) (3.16.2)

taxpasta merge -p krakenuniq -o test.biom kuniq.txt kuniq1.txt [09:50:00] CRITICAL The desired file format 'BIOM' is currently not merge.py:334 available. Please pip install 'taxpasta[biom]' to support it.

Midnighter commented 1 year ago

Hmm, that pip install command didn't actually install anything. I'll look into that. In the meantime, you can fix it by running

pip install biom-format
emilyvansyoc commented 1 year ago

I fixed the issue by installing taxpasta into a new, empty conda environment:

conda create -n taxapasta

conda activate taxpasta

conda install -c bioconda taxpasta

pip install 'taxpasta[biom]'

And that worked: $ taxpasta merge -p krakenuniq -o test.biom kuniq.txt kuniq1.txt [09:39:59] INFO Write result to 'test.biom'. merge.py:453 (/storage/work/epb5360/conda_envs/taxpasta)

However, it seems like only the taxID is carried through to the biom table, i.e., there is no taxonomy string? And it writes only an hd5 format biom table, which is not compatible with downstream programs like phyloseq... Any way you could write in a --fmt flag for the option to convert to human-readable json biom format, and add a flag or separate argument to parse out the taxonomy? For the way I do microbiome analyses, the taxpasta package is not usable without taxonomy.

emilyvansyoc commented 1 year ago

One more comment here... it's not clear if the counts are collapsed down to the lowest taxonomy level. Kraken and KrakenUniq (and I'm assuming Bracken) are really hard to parse to a usable format because they give counts for all taxonomic levels. The kraken-biom function collapses this by the lowest taxonomic class that is assigned. KrakenUniq is super annoying (functional? TBD) because it assigns a gazillion taxonomic levels past the D,P,O,C,F,G,S assignments. Is Taxpasta collapsing down to the subspecies, species, or genus level, or is it returning all lines of the KrakenUniq report, which include counts at higher taxonomic levels?

Thanks!

Midnighter commented 1 year ago

However, it seems like only the taxID is carried through to the biom table, i.e., there is no taxonomy string? And it writes only an hd5 format biom table, which is not compatible with downstream programs like phyloseq... Any way you could write in a --fmt flag for the option to convert to human-readable json biom format, and add a flag or separate argument to parse out the taxonomy? For the way I do microbiome analyses, the taxpasta package is not usable without taxonomy.

The latest version also adds a taxtable to the row data if you provide the --taxonomy option. That is probably not clearly documented. (Suggestions welcome!)

As far as I know, JSON is only supported for BIOM format 1.0, whereas in 2.x HDF5 is the standard. In any case, I am perfectly able to parse the resulting file with phyloseq.

taxpasta merge --wide --profiler kraken2 --output result.biom --ignore-errors --taxonomy ~/.taxonkit tests/data/kraken2/*
> phyloseq::import_biom("result.biom")
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 56 taxa and 5 samples ]
tax_table()   Taxonomy Table:    [ 56 taxa by 17 taxonomic ranks ]

If you also need a tree (I guess you do), then I currently can't offer a simple solution for that. It should not be too hard to convert the tax table into a tree. I'm talking to @antagomir and @TuomasBorman for a solution in mia and the TreeSummarizedExperiment. It basically requires supporting more generalized ranks than the standard seven, but can already convert a tax table to a tree with the default ranks. I suppose, that tree could then also be used with phyloseq if you prefer that.

Midnighter commented 1 year ago

One more comment here... it's not clear if the counts are collapsed down to the lowest taxonomy level. Kraken and KrakenUniq (and I'm assuming Bracken) are really hard to parse to a usable format because they give counts for all taxonomic levels. The kraken-biom function collapses this by the lowest taxonomic class that is assigned. KrakenUniq is super annoying (functional? TBD) because it assigns a gazillion taxonomic levels past the D,P,O,C,F,G,S assignments. Is Taxpasta collapsing down to the subspecies, species, or genus level, or is it returning all lines of the KrakenUniq report, which include counts at higher taxonomic levels?

By default, taxpasta will carry over everything reported in the sources to the final output. That means, that all possible ranks may be included. You could use the option --summarise-at which will report only a selected rank and sum up relative abundances below that rank. It ignores everything report above the chosen rank.

emilyvansyoc commented 1 year ago

Ah, thank you! I can use that to add taxonomy and import into a biom. It's a great resource! May take me a bit more time to figure out the right flags to get taxonomy headers and summarize properly. The tough part about this is that I'm generally interested in the lowest taxonomic classification for a particular taxa... which is a huge challenge given the general Kraken output.

Anyways - thanks so much for your quick responses and for creating a great resource!

Midnighter commented 10 months ago

I'm considering providing a small R package as well, that makes it quick to import taxprofiler output and generate a phyloseq or TreeSummarizedExperiment object.

For now, I'm closing this as the main issue looks resolved to me.