Open cboettig opened 6 years ago
Nice!
Rank normalization is definitely important. I'd approach this rank normalization as any kind of term mapping: retain the original and provide a reproducible mapping. Similar to taxon term, I imagine that we select / create a rank ontology/vocabulary (can be more than one), and then provide exhaustive mappings from provided terms to these controller terms. Perhaps we can re-use the wikidata rank ids.
map:
providedId | providedName | resolvedId | resolvedName |
---|---|---|---|
classe | some:id0 | class | |
soort | some:id1 | genus |
and cache:
id | name | rank | commonNames | path | pathIds | pathNames | externalUrl | thumbnailUrl |
---|---|---|---|---|---|---|---|---|
some:id0 | class | klasse @nl | class @en | ||||||
some:id1 | genus | soort @nl | genus @en |
Once this is present, we can easily produce various taxonCache formats, pipe-delimited (aka dwca taxon hierarchy style), column-based as you proposed or perhaps a hybrid.
This said, I think that taxonCache / taxonMap should probably be renamed to termCache and termMap, because they describe a mapping from one ranked hierarchical term to another in a relatively straightforward way.
Curious to hear your thoughts.
Very nice, I think this sounds like exactly the way to go. And using wikidata ids for resolved ID rank labels sounds like a good choice.
Cc @pmidford do you still maintain your rank ontology? Has this been superseded by wikidata?
I haven't maintained the rank ontology for a while, I'd check with @balhoff (and I'm ok if you change this in the foundry entry).
I've create a simple (perhaps too simple) mapping from occurring rank names to wikidata ids here: https://github.com/globalbioticinteractions/nomer/blob/45543c100c4b4c3a892250768ffe0bb7ef6569fe/nomer/src/main/resources/org/globalbioticinteractions/nomer/util/taxon_rank_links.tsv and here: https://github.com/globalbioticinteractions/nomer/blob/45543c100c4b4c3a892250768ffe0bb7ef6569fe/nomer/src/main/resources/org/globalbioticinteractions/nomer/util/taxon_ranks.tsv
With that, you can know do things like:
echo -e "\tgenus" | java -jar nomer.jar append globi-taxon-rank
which results in:
genus SAME_AS WD:Q34740 genus genus https://www.wikidata.org/wiki/Q34740
Still have to figure out a clever way to apply the same to lists like family | genus
. . .
Now pre-populating rank data from live wikidata, adding common names from many languages and associated mapping.
So now, echo -e "\種" | java -jar nomer/target/nomer-0.0.1-SNAPSHOT-jar-with-dependencies.jar append globi-taxon-rank
produces:
種 SAME_AS WD:Q7432 species Speiceas @ga | especie @gl | Juehegua @gn | Art @gsw | જાતિ @gu | जाति @hi | Species @hif | Vrsta @hr | Družina @hsb | Espès @ht | faj @hu | տեսակ @hy | specie @ia | Spesies @id | sebbangan @ilo | Tegund @is | specie @it | jutsi @jbo | Spesies @jv | სახეობა @ka | Түр @kk | ಜಾತಿ @kn | Биология тюрлю @krc | Cure @ku | species @la | Aart @lb | Spéce @lmo | ແອສະແປດ @lo | Rūšis @lt | suga @lv | вид @mk | ഉപവർഗ്ഗം @ml | Spesies @ms | speċi @mt | မျိုးစိတ် @my | تی @mzn | Specia @nap | art @nb | Oort (Biologie) @nds | Soort @nds-nl | प्रजाति @new | art @nn | espècia @oc | ପ୍ରଜାତି @or | ਪ੍ਰਜਾਤੀ @pa | Espesye @pam | Spece @pms | سپیشیز @pnb | توکمونه @ps | espécie @pt | espécie @pt-br | Rikch'aq @qu | specie @ro | вид @ru | Вид @rue | Көрүҥ @sah | specia @scn | species @sco | Vrsta @sh | Druh @sk | Nuucyada dhirta @so | Lloji @sq | врста @sr | Spésiés @su | Spishi @sw | намуд @tg | намуд @tg-cyrl | namud @tg-latn | สปีชีส์ @th | Biologik görnüş @tk | Espesye @tl | Tür @tr | төр @tt | төр @tt-cyrl | tör @tt-latn | вид @uk | نوع @ur | Tur @uz | Spece @vec | loài @vi | Sôorte @vls | Indje @wa | Espesye @war | זגאל @yi | 物種 @yue | 种 @zh | 种 @zh-cn | 物种 @zh-hans | 種 @zh-hk | විශේෂය @si | soort @nl | specio @eo | 종 @ko | gatunek @pl | species @en-gb | Vrsta @bs | species @en | art @sv | especie @es | espèce @fr | نوع @ar | Art @de | 種 @ja | espècie @ca | இனம் @ta | 物種 @zh-hant | מין @he | Vrsta @sl | گونه @fa | Зүйл @mn | species @en-ca | జాతి @te | Chéng @nan | druh @cs | Spesie @af | Especie @an | প্ৰজাতি @as | Especie @ast | Bioloji növ @az | Төр @ba | Oart @bar | біялагічны від @be | від @be-tarask | вид @bg | প্রজাতি @bn | Spesad @br | جۆرە @ckb | spezia @co | Тĕс @cv | Rhywogaeth @cy | art @da | Art @de-ch | είδος @el | Liik @et | espezie @eu | laji @fi | Slach @frr | soarte @fy species WD:Q7432 https://www.wikidata.org/wiki/Q7432
just added a feature to nomer called "replace", which allows for:
echo -e "\tsoort | orde | familie" | java -jar nomer.jar replace globi-taxon-rank
to produce
WD:Q7432 | WD:Q36602 | WD:Q35409\tspecies | order | family
. Rather than appending the matched results to the end of the line, replace puts the matches inside the existing columns / table. Note that soort, orde and familie are Dutch common names for ranks species, order and family respectively. The preceding ids are the corresponding wiki data taxon rank items, like https://www.wikidata.org/wiki/Q7432 (species). Note that replace also supports pipe separated lists.
This replace command should make it easier to normalize existing tabular files without having to cut/paste columns all over the place.
@jhpoelen I've applied the rank map to the taxonCache to standardize the rank names as discussed above. I've noticed that for some IDs, this results in conflicting names for a given rank, e.g.
id species path path_rank
<chr> <chr> <chr> <chr>
1 INAT_TAXON:121955 Blechnum discolor Polypodiopsida class
2 INAT_TAXON:121955 Blechnum discolor Filicopsida class
3 INAT_TAXON:122082 Phegopteris connectilis Polypodiopsida class
4 INAT_TAXON:122082 Phegopteris connectilis Filicopsida class
5 INAT_TAXON:122909 Asplenium bulbiferum Polypodiopsida class
6 INAT_TAXON:122909 Asplenium bulbiferum Filicopsida class
7 INAT_TAXON:127079 Polypodium vulgare Polypodiopsida class
8 INAT_TAXON:127079 Polypodium vulgare Filicopsida class
9 INAT_TAXON:129910 Chthamalus dalli "" class
10 INAT_TAXON:129910 Chthamalus dalli Hexanauplia class
...
This could easily be me just messing something up, haven't had a chance to dig into why I'm getting two different class names mapping to the same taxon id in these cases but thought I'd file this as a placeholder at least.
@cboettig you've uncovered an inconsistency between Global Names' perspective of iNaturalist taxonomy versus the iNaturalist taxonomy available through their rate limited API.
Note for instance, https://www.inaturalist.org/taxa/121955 (retrieved 2018-06-09, see screenshot), has class Polypodiopsida, whereas a search in http://resolver.globalnames.org results in the same taxon related to a different class Filicopsida (see attached json retrieved 2018-06-09). Are there any non-iNaturalist taxa in this list?
Yeah, looks like it includes these prefixes
1 GBIF
2 INAT_TAXON
3 IRMNG
4 NBN
5 NCBI
6 OTT
7 WORMS
~ from about 132,015 ids
Thanks for sharing. I poked around a bit and the root cause of the multiple entries of ids range from a changing interpretation of the id (e.g., iNaturalist vs global names' inaturalist), a difference in interpretation (e.g., open tree of life is a synthetic taxonomy with underlying source taxonomies which sometimes overlap - I used the underlying taxonomy in path entries, whereas global names uses the ott synthetic ids in path) and silly export bugs (that pre-date nomer) like having an external url be inferred from an equivalent taxon.
I created the taxon cache to overcome the reproducibility and performance hurdles introduced by data exposed via web apis. Now, it appears that the taxonCache is exposing the different interpretation of specific taxon ids by different systems (e.g., GBIF ids can be resolved in three ways: via eol.org, via gbif api, via globalnames, and indirectly via wikidata). Also, given my experience that APIs produce time dependent results, I can see how taxonCache is almost turning into some funky time series on the interpretation of a specific id.
All the more reason to urge those that publish identifiers and their interpretation to publish easy to use, versioned archives so that compiling consistent perspectives on a name graph can be done without breaking the bank.
However, knowing that a consistent, versioned (how was this id/name interpreted in 2014 by xyz?), global names id mapping registry might take a while to materialize, this leaves some pragmatic questions - should GloBI be responsible to produce a consistent, non-conflicting interpretation of term/taxon ids? Or should GloBI do a best effort to link names and ids and include inconsistent interpretations?
Well said. Seems, perhaps unsurprisingly, that not all identifiers are created equal here; and these issues may be more acute with some than with others? Ie ITIS doesn’t seem to exhibit this tendency to resolve differently from different sources? Perhaps there is a subset of ID namespace authorities GLOBI could deem reliable?
Re versioning, my understanding was that a proper uri ID ought to be permanent in how it resolves and what it resolves to. Changes should get new ids. Information associated with the ID ought to include a date published , allowing one to decide which of two ids for the “same” species is most current, right?
@cboettig lets say that an identifier is a coded outcome of a shared understanding of a classification of some group of organisms by a group of humans at a particular time. Now, assuming that the shared understanding and group dynamics change over time, it is intuitive to see an identifier as time dependent. Studying the availability, use and interpretation of a specific identifier would be a proxy for studying the dynamics of the group of humans that maintains the identifier. With a linked name graph like the one that GloBI taxon graph provides, you can not only study the shift/drift in interpretation of taxon ids, you can also compare the drift to equivalent taxonomic ids from other name source. I believe such studies have been done with scientific names, and haven't yet seen a similar exercise with identifiers. I am not quite sure whether to say that "shifting" identifiers are necessary less reliable. They might actually be an indication of an active curation of a taxonomic naming scheme.
Not quite sure about the statement "changes should get new ids". Practically, I'd say that identifiers should be resolvable / retrievable, be marked as deprecated, replaced or updated by the id curators, and keep older copies around with some sort of versioning scheme for as long as the project is maintained. Depending on the use case, the consumer of the ids can then select the appropriate version of the id. Whichever way the versioning is implemented (e.g., datetime stamp, git-like ledger/content hash), the adoption of this hinges on the practical use: given that most content publishers have trouble enough keeping their content relevant and accessible, I would assume that versioning of the published artifacts is probably best left to specialized services similar to the Internet Archive's way back machine.
All very complicated, and for practical purposes, I'd say that keeping versioned copies around for all data retrieved from a name source by individual projects would be a start. This is what I settled on by creating GloBI's taxon graph. As far as I can tell, publishing such an easy to access/parse name graph makes for interesting observations of the underlying services used.
btw - Nico Franz was kind enough to point me to Franz, N.M. & Peet, R.K., 2009. Perspectives: Towards a language for mapping relationships among taxonomic concepts. Systematics and Biodiversity, 7(1), pp.5–20. Available at: http://dx.doi.org/10.1017/s147720000800282x.
Still thinking about the pipe string manipulation and alternatives.
Building on your idea of providing some alternate tabular representations to meet different use cases, I think it would be particularly useful to be able to have each rank a a separate columns.
My my tally, there's about 125 different terms that are used in
pathName
pipe strings:but some of these are really duplicates (abbreviated names, german names, missing values) could be collapsed down. (I think it might be preferable for some of that mapping to be in a clean version of taxonCache representation, but maybe that's ill-advised to alter the original data strings in that way).
Having just the handful of ranks defined in Darwin Core as columns would be a particularly minimal but convenient starting point, though it would seem pretty manageable to include a few
sub
,super
,infra
divisions as additional columns. What do you think?This leaves open a question of mapping the
pathIds
into this schema; but perhaps that could be a separate table or just a duplicate row?