globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Name matching against NCBI Taxonomy #32

Closed nleguillarme closed 3 years ago

nleguillarme commented 3 years ago

Hi @jhpoelen.

I have a lot of taxon names I'd like to match to the NCBI Taxonomy (because NCBI is actually the only taxonomy with an ontology representation : http://www.obofoundry.org/ontology/ncbitaxon.html)

One way to do that is to use Global Names Resolver. However, it seems that Global Names Resolver is not able to resolve taxon names tagged as synonyms in NCBI.

For instance, Holosticha manca is not resolved as a synonym of Anteholosticha manca : https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=385028

So I think it would be interesting to have a matcher that directly interacts with the NCBI Taxonomy for name matching, similar to this python package : https://pypi.org/project/ncbi-taxonomist/

jhpoelen commented 3 years ago

hey @nleguillarme -

I have a lot of taxon names I'd like to match to the NCBI Taxonomy (because NCBI is actually the only taxonomy with an ontology representation : http://www.obofoundry.org/ontology/ncbitaxon.html)

Shouldn't be too hard to do similar things with other taxonomies, but I can see that it would be easy to reuse an existing resource.

However, it seems that Global Names Resolver is not able to resolve taxon names tagged as synonyms in NCBI.

Did you consider contacting the Global Names folks about this? (e.g., @dima)

For instance, Holosticha manca is not resolved as a synonym of Anteholosticha manca : https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=385028

So I think it would be interesting to have a matcher that directly interacts with the NCBI Taxonomy for name matching, similar to this python package : https://pypi.org/project/ncbi-taxonomist/

I can see how it would be nice to have a fast NCBI name matcher with offline support . A quick glance at the ncbi data, tell me that:

.../ncbi-taxa$ cat names.dmp | grep -E "[a-zA-Z]*[ ]+manca"
385028  |   Anteholosticha manca (Kahl, 1932) Berger, 2003  |       |   authority   |
385028  |   Anteholosticha manca    |       |   scientific name |
385028  |   Holosticha manca Kahl, 1932 |       |   authority   |
385028  |   Holosticha manca    |       |   synonym |

Because nomer already supports offline matching of ncbi taxa by id, support for matching by (exact) name / synonyms can also be added. Would you use that ?

nleguillarme commented 3 years ago

Shouldn't be too hard to do similar things with other taxonomies, but I can see that it would be easy to reuse an existing resource.

I agree with you, and this is something I may consider in the future, e.g. exporting the GBIF Backbone taxonomy as an ontology.

Did you consider contacting the Global Names folks about this? (e.g., @dima)

Well I checked the GitHub repo of Global Name Resolver : the last commit is 4 years ago, so I was wondering if the project is still alive...

Because nomer already supports offline matching of ncbi taxa by id, support for matching by (exact) name / synonyms can also be added. Would you use that ?

I would absolutely use that !

jhpoelen commented 3 years ago

@nleguillarme I've implemented a first version of offline-enable id/name/synonym matching against ncbi taxonomy.

$ echo -e "\tAriopsis felis\n\tHolosticha manca" | nomer append ncbi-taxon
    Ariopsis felis  SAME_AS NCBI:75286  Ariopsis felis  species     root | cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Actinopterygii | Actinopteri | Neopterygii | Teleostei | Osteoglossocephalai | Clupeocephala | Otomorpha | Ostariophysi | Otophysi | Characiphysae | Siluriformes | Siluroidei | Ariidae | Ariopsis | Ariopsis felis    NCBI:1 | NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:7898 | NCBI:186623 | NCBI:41665 | NCBI:32443 | NCBI:1489341 | NCBI:186625 | NCBI:186634 | NCBI:32519 | NCBI:186626 | NCBI:186628 | NCBI:7995 | NCBI:1489793 | NCBI:31017 | NCBI:243723 | NCBI:75286    |  | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | class | subclass | infraclass | clade |  | cohort | subcohort | clade | superorder | order | suborder | family | genus | species    https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=75286   
    Holosticha manca    SYNONYM_OF  NCBI:385028 Anteholosticha manca    species     root | cellular organisms | Eukaryota | Sar | Alveolata | Ciliophora | Intramacronucleata | Spirotrichea | Stichotrichia | Urostylida | Holostichidae | Anteholosticha | Anteholosticha manca   NCBI:1 | NCBI:131567 | NCBI:2759 | NCBI:2698737 | NCBI:33630 | NCBI:5878 | NCBI:431838 | NCBI:33829 | NCBI:194286 | NCBI:486728 | NCBI:578128 | NCBI:584654 | NCBI:385028   |  | superkingdom | clade | clade | phylum | subphylum | class | subclass | order | family | genus | species    https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=385028  

Note that the first time, will be slow, because it'll download/index a new copy of the NCBI taxonomy as configured.

Also, if you have an existing ncbi cache, please use nomer clean to clear our the old local index first.

I'll work on publishing a new release with this new matcher in it. Thanks for being patient.

jhpoelen commented 3 years ago

The recently created Nomer release https://github.com/globalbioticinteractions/nomer/releases/tag/0.1.24 contains the first pass at the NCBI name/synonym you describe.

Curious to hear your comments on the new functionality.

nleguillarme commented 3 years ago

It works perfectly.

Converting GBIF taxon to NCBI taxon is not trivial. I now make a first pass with wikidata-taxon-id-web, then try to match on names using globi-taxon-cache, then ncbi-taxon. The synonym information is useful to match a few more names.

Thank you for your help and your reactivity as always.