globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

support offline-enabled wikidata taxon matcher #181

Closed jhpoelen closed 3 months ago

jhpoelen commented 3 months ago

as related to #146

jhpoelen commented 3 months ago

Currently, the wikidata dump is about 83.5G too large to fit into Zenodo.

Suggest to only include items with reference to a Taxon https://www.wikidata.org/wiki/Q16521

image

jhpoelen commented 3 months ago

sketch of workflow -

#!/bin/bash
#
# streams Wikidata taxon items (or items containing https://www.wikidata.org/wiki/Q16521)
# from latest data dump in line json (one json object per line)
#

curl --silent "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"\
| bunzip2\
| grep -E "Q16521[^0-9]"\
| sed 's/,$//g'\
| bzip2
jhpoelen commented 3 months ago

hey @daniel-mietchen

Would you happen to know how to translate a wikimedia url like

https://commons.wikimedia.org/wiki/File:002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg

into a link that renders a jpg ?

PS I've dropped indexing the wikidata taxon images until we develop a method to point to a image (or image rendering link) directly.

jhpoelen commented 3 months ago

A first pass at implementing an offline-enabled wikidata taxon matcher -

echo -e "\tElymus repens"\
 | nomer append\
 --include-header wikidata\
 | mlr --itsvlite --oxtab cat

produced -

providedExternalId      
providedName            Elymus repens
relationName            HAS_ACCEPTED_NAME
resolvedExternalId      WD:Q276262
resolvedName            Elymus repens
resolvedAuthorship      
resolvedRank            WD:Q7432
resolvedCommonNames     Gewöhnliche Quecke @de | quackgrass @en | niittyjuola @fi | 偃麦草 @zh
resolvedPath            Spermatophytes | Magnoliophyta | Liliopsida | Commelinidae | Cyperales | Poaceae | Pooideae | Triticeae | Elymus | Elymus repens
resolvedPathIds         WD:Q25814 | WD:Q14562931 | WD:Q1147601 | WD:Q1115272 | WD:Q1860104 | WD:Q43238 | WD:Q4662262 | WD:Q148694 | WD:Q1072892 | WD:Q276262
resolvedPathNames       WD:Q3491997 | WD:Q38348 | WD:Q37517 | WD:Q5867051 | WD:Q36602 | WD:Q35409 | WD:Q164280 | WD:Q227936 | WD:Q34740 | WD:Q7432
resolvedPathAuthorships |  |  |  |  |  |  |  |  |
resolvedExternalUrl     https://www.wikidata.org/wiki/Q276262
jhpoelen commented 3 months ago

Note that non-wikidata identifiers are also supported, if known to wikidata -

e.g.,

echo -e "ITIS:512839"\
  | nomer append --include-header wikidata\
 | mlr --itsvlite --oxtab cat
providedExternalId      ITIS:512839
relationName            SYNONYM_OF
resolvedExternalId      WD:Q276262
resolvedName            Elymus repens
resolvedAuthorship      
resolvedRank            WD:Q7432
resolvedCommonNames     Gewöhnliche Quecke @de | quackgrass @en | niittyjuola @fi | 偃麦草 @zh
resolvedPath            Spermatophytes | Magnoliophyta | Liliopsida | Commelinidae | Cyperales | Poaceae | Pooideae | Triticeae | Elymus | Elymus repens
resolvedPathIds         WD:Q25814 | WD:Q14562931 | WD:Q1147601 | WD:Q1115272 | WD:Q1860104 | WD:Q43238 | WD:Q4662262 | WD:Q148694 | WD:Q1072892 | WD:Q276262
resolvedPathNames       WD:Q3491997 | WD:Q38348 | WD:Q37517 | WD:Q5867051 | WD:Q36602 | WD:Q35409 | WD:Q164280 | WD:Q227936 | WD:Q34740 | WD:Q7432
resolvedPathAuthorships |  |  |  |  |  |  |  |  |
resolvedExternalUrl     https://www.wikidata.org/wiki/Q276262
jhpoelen commented 3 months ago

While working towards addressing a misaligned taxon reported in https://github.com/globalbioticinteractions/globalbioticinteractions/issues/968 by @kbseah, a first version of an offline-enabled wikidata taxon name alignment matcher was introduced in Nomer v0.5.11 .