globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

NCBI taxonomy reports equivalence between [Candidatus Endoriftia persephone] and [Endoriftia persephone] but Nomer's NCBI matcher does not #180

Closed jhpoelen closed 4 months ago

jhpoelen commented 4 months ago

In https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=393765&lvl=3&lin=f&keep=1&srchmode=1&unlock image

NCBI taxonomy reports equivalence between [Candidatus Endoriftia persephone] and [Endoriftia persephone] but Nomer's NCBI matcher does not via

echo -e "\tCandidatus Endoriftia persephone"\
 | nomer append --include-header ncbi\
 | mlr --itsvlite --oxtab cat

yields

providedExternalId      
providedName            Candidatus Endoriftia persephone
relationName            SAME_AS
resolvedExternalId      NCBI:393765
resolvedName            Candidatus Endoriftia persephone
resolvedAuthorship      
resolvedRank            species
resolvedCommonNames     
resolvedPath            root | cellular organisms | Bacteria | Proteobacteria | Gammaproteobacteria | Gammaproteobacteria incertae sedis | sulfur-oxidizing symbionts | Candidatus Endoriftia | Candidatus Endoriftia persephone
resolvedPathIds         NCBI:1 | NCBI:131567 | NCBI:2 | NCBI:1224 | NCBI:1236 | NCBI:118884 | NCBI:32036 | NCBI:393764 | NCBI:393765
resolvedPathNames       |  | superkingdom | phylum | class |  | clade | genus | species
resolvedPathAuthorships |  |  | [class] Stackebrandt et al. 1988 | Garrity et al. 2005 emend. Williams and Kelly 2013 |  |  |  |
resolvedExternalUrl     https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=393765

but . . .

echo -e "\tEndoriftia persephone"\
 | nomer append ncbi

unexpectedly reports no match.

    Endoriftia persephone   NONE        Endoriftia persephone   

related to https://github.com/globalbioticinteractions/globalbioticinteractions/issues/968 @kbseah

jhpoelen commented 4 months ago

The Nomer Corpus of Taxonomic Resources related to Nomer v0.5.10 (current version) is:

Poelen, J. H. (ed . ) . (2024). Nomer Corpus of Taxonomic Resources hash://sha256/3361f03229301a339b86779df0d74ed9ab564b1ef98dda4556ed0a0cafc28700 hash://md5/970d771ac2ff45e42a30b5cf88bf6a1b (0.25) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12117955

and the most recent copy of NCBI taxonomy was captured on 2022-09-09T20:06:13.047Z with signature hash://sha256/30364d6dd82332e7da3aae6ce5c36a56de5e7d62f28c4490623f0c4cdd7875f6 via https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz because

preston ls --anchor hash://sha256/3361f03229301a339b86779df0d74ed9ab564b1ef98dda4556ed0a0cafc28700 --remote https://linker.bio,https://zenodo.org/records/12117955/files,https://zenodo.org/records/11105453/files/,https://zenodo.org/records/10045382/files/,https://zenodo.org/records/10037817/files/,https://zenodo.org/records/8327611/files/,https://zenodo.org/records/10044989/files/ | grep --before 10  "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz" | grep -P "202[0-9]-[0-9]{2}-[0-9]{2}" | head -1

produced

<urn:uuid:6f2405cb-b26d-4043-8c9a-29bdccaee705> <http://www.w3.org/ns/prov#generatedAtTime> "2022-09-09T20:06:13.047Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:6f2405cb-b26d-4043-8c9a-29bdccaee705> .

with

<https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz> <http://purl.org/pav/hasVersion> <hash://sha256/30364d6dd82332e7da3aae6ce5c36a56de5e7d62f28c4490623f0c4cdd7875f6> <urn:uuid:6d92c3d3-5a7f-4597-b639-cee8995c1cea> .
jhpoelen commented 4 months ago

inspecting the version of the ncbi taxonomic resource using

preston cat --anchor hash://sha256/3361f03229301a339b86779df0d74ed9ab564b1ef98dda4556ed0a0cafc28700 --remote https://linker.bio,https://zenodo.org/records/12117955/files,https://zenodo.org/records/11105453/files/,https://zenodo.org/records/10045382/files/,https://zenodo.org/records/10037817/files/,https://zenodo.org/records/8327611/files/,https://zenodo.org/records/10044989/files/ 'tar:gz:hash://sha256/30364d6dd82332e7da3aae6ce5c36a56de5e7d62f28c4490623f0c4cdd7875f6!/ncbi.ncbi!/names.dmp' | grep "Endoriftia persephone"

produced

393765  |   "Candidatus Endoriftia persephone" Robidart et al. 2008 |       |   authority   |
393765  |   Candidatus Endoriftia persephone    |       |   scientific name |
393765  |   Endoriftia persephone   |       |   equivalent name |
394104  |   Candidatus Endoriftia persephone str. Hot96_1+Hot96_2   |       |   scientific name |
394104  |   Endoriftia persephone 'Hot96_1+Hot96_2' |       |   equivalent name |
910259  |   Candidatus Endoriftia persephone str. Guaymas   |       |   scientific name |
910259  |   Endoriftia persephone 'Guaymas' |       |   synonym |
910259  |   Endoriftia persephone str. Guaymas  |       |   synonym |

which indicates that the 2022 copy of ncbi did already have the equivalent relation in it.

jhpoelen commented 4 months ago

After adding support for NCBI "equivalent to" relations, the following result was obtained using

echo -e "\tEndoriftia persephone"\
 | nomer append --include-header ncbi\
 | mlr --itsvlite --oxtab cat

yielding:

providedExternalId      
providedName            Endoriftia persephone
relationName            SYNONYM_OF
resolvedExternalId      NCBI:393765
resolvedName            Candidatus Endoriftia persephone
resolvedAuthorship      
resolvedRank            species
resolvedCommonNames     
resolvedPath            root | cellular organisms | Bacteria | Proteobacteria | Gammaproteobacteria | Gammaproteobacteria incertae sedis | sulfur-oxidizing symbionts | Candidatus Endoriftia | Candidatus Endoriftia persephone
resolvedPathIds         NCBI:1 | NCBI:131567 | NCBI:2 | NCBI:1224 | NCBI:1236 | NCBI:118884 | NCBI:32036 | NCBI:393764 | NCBI:393765
resolvedPathNames       |  | superkingdom | phylum | class |  | clade | genus | species
resolvedPathAuthorships |  |  | [class] Stackebrandt et al. 1988 | Garrity et al. 2005 emend. Williams and Kelly 2013 |  |  |  |
resolvedExternalUrl     https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=393765

For now, the relation "equivalent to" is translated into "synonym of" until someone proposes a more suitable relation.