shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

[Feature Request] Fuzzy name searching with name2taxid #88

Open jolespin opened 8 months ago

jolespin commented 8 months ago

Prerequisites

Describe your issue

Similar to https://github.com/etetoolkit/ete/blob/1582ea2aa0d28065f4757b8b5af74367f6abe19f/ete4/ncbi_taxonomy/ncbiquery.py#L112C30-L112C30

    def get_fuzzy_name_translation(self, name, sim=0.9):
        """Return taxid, species name and match score from the NCBI database.

        The results are for the best match for name in the NCBI
        database of taxa names, with a word similarity >= `sim`.

        :param name: Species name (does not need to be exact).
        :param 0.9 sim: Min word similarity to report a match (from 0 to 1).
        """

For example, EukZoo has an annotation from a source organism id AddRef0031 labeled as species Paramecium tetraurelia and strain Stock d4-2. A manual search for this shows https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=412030 but the keyword "stock" is missing even though they are certainly the same.

Would it be possible to include this type of searching?

fgvieira commented 3 months ago

This would be quite helpful!

shenwei356 commented 3 months ago

Oh, I missed this issue before. There are some existing packages I can use.

shenwei356 commented 3 weeks ago

Implemented with https://github.com/suggest-go/suggest/ , it supports writing the index to file but I didn't make it. So right now, it's an in-memory index, which is slow to build for every run.

Fuzzy match:

memusg -t -s "echo Paramecium tetraurelia strain Stock d4-2 | taxonkit name2taxid -f  --verbose | taxonkit lineage -L -nr -i 2"
11:52:09.824 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
11:52:13.027 [INFO] 3942782 names parsed
11:52:13.027 [INFO] creating indexing for name searching ...
11:52:47.166 [INFO] indexing finished
Paramecium tetraurelia strain Stock d4-2        412030  Paramecium tetraurelia strain d4-2      strain

elapsed time: 37.530s
peak rss: 3.6 GB

Exact match:

memusg -t -s "echo Paramecium tetraurelia strain Stock d4-2 | ./taxonkit name2taxid  --verbose | taxonkit lineage -L -nr -i 2"
11:51:34.730 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
11:51:37.907 [INFO] 3942782 names parsed
Paramecium tetraurelia strain Stock d4-2

elapsed time: 3.328s
peak rss: 1.73 GB

Try it:

  -f, --fuzzy             allow fuzzy match
  -n, --fuzzy-top-n int   choose top n matches in fuzzy search (default 1)