taxprofiler / taxpasta

TAXnomic Profile Aggregation and STAndardisation
https://taxpasta.readthedocs.io/
Apache License 2.0
31 stars 6 forks source link

[BUG] tax entries with `no rank` failed `--add-lineage`, `--add-id-lineage` or `--add-rank-lineage` #141

Closed MajoroMask closed 6 months ago

MajoroMask commented 7 months ago

Is there an existing issue for this?

Problem description

Hi guys. I'm running taxpasta with --add-lineage --add-id-lineage --add-rank-lineage options on and found the result maybe bugged.

As far as I know, if any of --add-lineage, --add-id-lineage or --add-rank-lineage is given during the call of taxpasta, the corresponding columns lineage, id_lineage and rank_lineage should contain the information from columns name, taxonomy_id and rank, respectively. Shown as below (part of my example data):

taxonomy_id name rank lineage id_lineage rank_lineage
0
1 root no rank
10239 Viruses superkingdom Viruses 10239 superkingdom
2731341 Duplodnaviria clade Viruses;Duplodnaviria 10239;2731341 superkingdom;clade
2731360 Heunggongvirae kingdom Viruses;Duplodnaviria;Heunggongvirae 10239;2731341;2731360 superkingdom;clade;kingdom
2731618 Uroviricota phylum Viruses;Duplodnaviria;Heunggongvirae;Uroviricota 10239;2731341;2731360;2731618 superkingdom;clade;kingdom;phylum
2731619 Caudoviricetes class Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes 10239;2731341;2731360;2731618;2731619 superkingdom;clade;kingdom;phylum;class
1978007 Crassvirales order Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Crassvirales 10239;2731341;2731360;2731618;2731619;1978007 superkingdom;clade;kingdom;phylum;class;order

But if one entry's rank is no rank (including root with taxonomy ID of 0), it will be missing in its children taxon's record. See how unclassified Caudoviricetes (2788787) is missing in the entry of Bifidobacterium phage BD811P2 (2968613, the last line) in the following example:

taxonomy_id name rank lineage id_lineage rank_lineage
0
1 root no rank
10239 Viruses superkingdom Viruses 10239 superkingdom
2731341 Duplodnaviria clade Viruses;Duplodnaviria 10239;2731341 superkingdom;clade
2731360 Heunggongvirae kingdom Viruses;Duplodnaviria;Heunggongvirae 10239;2731341;2731360 superkingdom;clade;kingdom
2731618 Uroviricota phylum Viruses;Duplodnaviria;Heunggongvirae;Uroviricota 10239;2731341;2731360;2731618 superkingdom;clade;kingdom;phylum
2731619 Caudoviricetes class Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes 10239;2731341;2731360;2731618;2731619 superkingdom;clade;kingdom;phylum;class
2788787 unclassified Caudoviricetes no rank Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes 10239;2731341;2731360;2731618;2731619 superkingdom;clade;kingdom;phylum;class
2968613 Bifidobacterium phage BD811P2 species Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Bifidobacterium phage BD811P2 10239;2731341;2731360;2731618;2731619;2968613 superkingdom;clade;kingdom;phylum;class;species

If searched in NCBI Taxonomy, you can see that unclassified Caudoviricetes is part of Bifidobacterium phage BD811P2's full lineage:

image

So I'm pretty sure this is a bug.

Code sample

Code run:

taxpasta merge \
    -p kraken2 -o merged.tsv --add-name --add-rank --add-lineage --add-id-lineage --add-rank-lineage \
    --taxonomy taxdump \
     \
    x.kraken2.report.txt y.kraken2.report.txt

Traceback:

[05:53:09] WARNING  The merged profiles contained different taxa. Additional zeroes were introduced for   sample_handling_application.py:142
                    missing taxa.
[05:53:12] INFO     Write result to 'merged.tsv'.

Environment

### Package Information | Package | Version | |:---------|--------:| | taxpasta | 0.6.1 | ### Dependency Information | Package | Version | |:-----------------------------|------------:| | bash-kernel | **missing** | | biom-format | 2.1.15 | | depinfo~ | **missing** | | jupyter | **missing** | | mkdocs-awesome-pages-plugin~ | **missing** | | mkdocs-exclude~ | **missing** | | mkdocs-material~ | **missing** | | mkdocstrings[python]~ | **missing** | | numpy~ | **missing** | | odfpy | **missing** | | openpyxl | **missing** | | pandas~ | **missing** | | pandera~ | **missing** | | pre-commit | **missing** | | pyarrow | 13.0.0 | | rich | 13.6.0 | | tabulate~ | **missing** | | taxopy~ | **missing** | | tox~ | **missing** | | typer~ | **missing** | ### Build Tools Information | Package | Version | |:-----------|--------:| | pip | 23.2.1 | | setuptools | 68.2.2 | | wheel | 0.41.2 | ### Platform Information | | | |:--------|-----------------------------:| | Linux | 3.10.0-957.el7.x86_64-x86_64 | | CPython | 3.11.6 |

Anything else?

Input files I'm using:

x.kraken2.report.txt y.kraken2.report.txt

Midnighter commented 7 months ago

This is a deliberate design choice. The idea is that the number of ranks and taxa in a lineage should match, so that you can create what phyloseq calls a taxtable, where the rank is the column header and each row corresponds to a given lineage.

Do you truly need all taxa in the lineage for a specific purpose or were you simply perplexed by their absence?

Midnighter commented 6 months ago

This issue needs information or other forms of answers from you. Please feel free to re-open the issue when you are able to provide them.