shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

Filter of ranks without order #97

Open alvanuffelen opened 2 months ago

alvanuffelen commented 2 months ago

Prerequisites

Describe your issue

In the documentation, it mentions:

  1. Ranks without order should be assigned a prefix symbol "!" for each rank.

This means !no rank and !clade are defined as rank without order.

The documentation also states:

  1. TaxIDs with no rank are kept by default!!! They can be optionally discarded by -N/--discard-noranks.

Following Taxid has rank clade: 1783270 cellular organisms;Bacteria;FCB group FCB group clade

As expected, the taxid is not filtered out with following command: echo 1783270 | ./taxonkit filter -L species

However, why is it filtered out when using -H? echo 1783270 | ./taxonkit filter -H species

Based on point 5 in the documentation, TaxIDs with no rank are kept by default, so I would expect them to be kept with both -L and -H

shenwei356 commented 2 months ago

Thanks for reporting this. It's fixed.

alvanuffelen commented 2 months ago

Thank you!

Would it also be possible to implement the -n feature in combination with -H? echo 2605619 | taxonkit filter -H genus Above line prints the taxid because it has no rank. I would like to do echo 2605619 | taxonkit filter -H genus -n such that the taxid gets filtered out (not printed) because the closest higher node is 'species' which is still lower than genus.

Additionally, the help page could me more clear:

-n, --save-predictable-norank do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff

The taxid is not only discarded 'when the rank of the closest higher node is lower than rank cutoff' but also when the rank is equal. E.g., : echo 2605619 | taxonkit filter -L species This gets printed because the closest higher rank is 'species' which is equal to the cutoff.

shenwei356 commented 2 months ago

echo 2605619 | taxonkit filter -H genus -N does filter out the taxid.


You're right. I'll update the doc.

-n, --save-predictable-norank do not discard some special ranks without order when using -L, where rank of the closest higher node is equal to or lower than the rank cutoff

alvanuffelen commented 2 months ago

Indeed, -N will discard all ranks without order. But let's say I have the taxids 93506 (higher rank than genus) and 2605619 (lower rank than genus), both no rank. There is no way to only retain the taxid with a higher rank than genus. echo -e "93506\n2605619" | taxonkit filter -H genus -N will remove both. echo -e "93506\n2605619" | taxonkit filter -H genus will retain both.

It would be useful to have something like: echo -e "93506\n2605619" | taxonkit filter -H genus -n which will remove 2605619 but keep 93506 .

shenwei356 commented 2 months ago

Oh, I remember now. I've considered this before but did not implement it because they are different for -L and -H.

I understand what you mean. But I think we should add another flag --discard-predictable-norank, which only discards these no-ranks (2605619) that can not be higher than the threshold.

--discard-predictable-norank should be incompatible with -N and -n.

  -N, --discard-noranks           discard all ranks without order, type "taxonkit filter --help" for details
  -n, --save-predictable-norank   do not discard some special ranks without order when using -L, where
                                  rank of the closest higher node is still lower than rank cutoff

  -Z, --discard-predictable-norank
echo -e "93506\n2605619" | taxonkit filter -H genus -Z
93506