globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Alternative table layout / views from taxonCache #7

Open cboettig opened 6 years ago

cboettig commented 6 years ago

Still thinking about the pipe string manipulation and alternatives.

Building on your idea of providing some alternate tabular representations to meet different use cases, I think it would be particularly useful to be able to have each rank a a separate columns.

My my tally, there's about 125 different terms that are used in pathName pipe strings:

  [1] "kingdom"            "phylum"             "class"              "order"              "family"             "genus"             
  [7] "species"            "subclass"           "suborder"           "superfamily"        "subfamily"          "subphylum"         
 [13] "subgenus"           "unknown"            "superorder"         "infraorder"         "section"            "subsection"        
 [19] "subkingdom"         "division"           "subdivision"        "superkingdom"       "regn."              "phyl."             
 [25] "cl."                "ord."               "fam."               "gen."               "sp."                "infrakingdom"      
 [31] "superdivision"      "infradivision"      "infraspecies"       "rijk"               "stam"               "klasse"            
 [37] "orde"               "familie"            "geslacht"           "soort"              "superclass"         "infraclass"        
 [43] "tribe"              "subspecies"         "superphylum"        "unranked clade"     "infraphylum"        "parvorder"         
 [49] "subtribe"           "informal"           "species group"      "var."               "onderklasse"        "superfamilie"      
 [55] "onderfamilie"       "varietas"           "variety"            "f."                 "subsp."             "species subgroup"  
 [61] "f.sp."              "forma"              "sub phylum"         "ß"                  "subgen."            "super cohort"      
 [67] "super order"        "onderorde"          "infraorde"          "tak"                "infraspecificname"  "form"              
 [73] "kaharian"           "paylum"             "klase"              "orden"              "superpamilya"       "pamilya"           
 [79] "sari"               "espesye"            "a"                  "?"                  "c"                  "b"                 
 [85] "d"                  "sect."              "trib."              "subsect."           "ser."               "subser."           
 [91] "tax.vag."           "e"                  "hybrid formula"     "g"                  "unranked"           "[unranked]"        
 [97] "sect"               "**"                 "v."                 "subdiv."            "subtrib."           "***"               
[103] "*****"              "epifamily"          "clade"              "hybrid"             "genushybrid"        "zoosection"        
[109] "species aggregate"  "species sensu lato" "cultivar"           "nothovariety"       "species hybrid"     "species pro parte" 
[115] "microspecies"       "generic hybrid"     "forma specialis"    "cohort"             "no rank"            "domain"            
[121] "no rank - terminal" "subvariety"         "superdomain"        "subterclass"        "supertribe"    

but some of these are really duplicates (abbreviated names, german names, missing values) could be collapsed down. (I think it might be preferable for some of that mapping to be in a clean version of taxonCache representation, but maybe that's ill-advised to alter the original data strings in that way).

Having just the handful of ranks defined in Darwin Core as columns would be a particularly minimal but convenient starting point, though it would seem pretty manageable to include a few sub, super, infra divisions as additional columns. What do you think?

This leaves open a question of mapping the pathIds into this schema; but perhaps that could be a separate table or just a duplicate row?

jhpoelen commented 6 years ago

Nice!

Rank normalization is definitely important. I'd approach this rank normalization as any kind of term mapping: retain the original and provide a reproducible mapping. Similar to taxon term, I imagine that we select / create a rank ontology/vocabulary (can be more than one), and then provide exhaustive mappings from provided terms to these controller terms. Perhaps we can re-use the wikidata rank ids.

map:

providedId providedName resolvedId resolvedName
  classe some:id0 class
  soort some:id1 genus

and cache:

id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
some:id0 class   klasse @nl | class @en            
some:id1 genus   soort @nl | genus @en          

Once this is present, we can easily produce various taxonCache formats, pipe-delimited (aka dwca taxon hierarchy style), column-based as you proposed or perhaps a hybrid.

This said, I think that taxonCache / taxonMap should probably be renamed to termCache and termMap, because they describe a mapping from one ranked hierarchical term to another in a relatively straightforward way.

Curious to hear your thoughts.

cboettig commented 6 years ago

Very nice, I think this sounds like exactly the way to go. And using wikidata ids for resolved ID rank labels sounds like a good choice.

cmungall commented 6 years ago

Cc @pmidford do you still maintain your rank ontology? Has this been superseded by wikidata?

pmidford commented 6 years ago

I haven't maintained the rank ontology for a while, I'd check with @balhoff (and I'm ok if you change this in the foundry entry).

jhpoelen commented 6 years ago

I've create a simple (perhaps too simple) mapping from occurring rank names to wikidata ids here: https://github.com/globalbioticinteractions/nomer/blob/45543c100c4b4c3a892250768ffe0bb7ef6569fe/nomer/src/main/resources/org/globalbioticinteractions/nomer/util/taxon_rank_links.tsv and here: https://github.com/globalbioticinteractions/nomer/blob/45543c100c4b4c3a892250768ffe0bb7ef6569fe/nomer/src/main/resources/org/globalbioticinteractions/nomer/util/taxon_ranks.tsv

With that, you can know do things like: echo -e "\tgenus" | java -jar nomer.jar append globi-taxon-rank which results in:

    genus   SAME_AS WD:Q34740   genus           genus       https://www.wikidata.org/wiki/Q34740    

Still have to figure out a clever way to apply the same to lists like family | genus. . .

jhpoelen commented 6 years ago

Now pre-populating rank data from live wikidata, adding common names from many languages and associated mapping.

So now, echo -e "\種" | java -jar nomer/target/nomer-0.0.1-SNAPSHOT-jar-with-dependencies.jar append globi-taxon-rank

produces:

    種   SAME_AS WD:Q7432    species     Speiceas @ga | especie @gl | Juehegua @gn | Art @gsw | જાતિ @gu | जाति @hi | Species @hif | Vrsta @hr | Družina @hsb | Espès @ht | faj @hu | տեսակ @hy | specie @ia | Spesies @id | sebbangan @ilo | Tegund @is | specie @it | jutsi @jbo | Spesies @jv | სახეობა @ka | Түр @kk | ಜಾತಿ @kn | Биология тюрлю @krc | Cure @ku | species @la | Aart @lb | Spéce @lmo | ແອສະແປດ @lo | Rūšis @lt | suga @lv | вид @mk | ഉപവർഗ്ഗം @ml | Spesies @ms | speċi @mt | မျိုးစိတ် @my | تی @mzn | Specia @nap | art @nb | Oort (Biologie) @nds | Soort @nds-nl | प्रजाति @new | art @nn | espècia @oc | ପ୍ରଜାତି @or | ਪ੍ਰਜਾਤੀ @pa | Espesye @pam | Spece @pms | سپیشیز @pnb | توکمونه @ps | espécie @pt | espécie @pt-br | Rikch'aq @qu | specie @ro | вид @ru | Вид @rue | Көрүҥ @sah | specia @scn | species @sco | Vrsta @sh | Druh @sk | Nuucyada dhirta @so | Lloji @sq | врста @sr | Spésiés @su | Spishi @sw | намуд @tg | намуд @tg-cyrl | namud @tg-latn | สปีชีส์ @th | Biologik görnüş @tk | Espesye @tl | Tür @tr | төр @tt | төр @tt-cyrl | tör @tt-latn | вид @uk | نوع @ur | Tur @uz | Spece @vec | loài @vi | Sôorte @vls | Indje @wa | Espesye @war | זגאל @yi | 物種 @yue | 种 @zh | 种 @zh-cn | 物种 @zh-hans | 種 @zh-hk | විශේෂය @si | soort @nl | specio @eo | 종 @ko | gatunek @pl | species @en-gb | Vrsta @bs | species @en | art @sv | especie @es | espèce @fr | نوع @ar | Art @de | 種 @ja | espècie @ca | இனம் @ta | 物種 @zh-hant | מין @he | Vrsta @sl | گونه @fa | Зүйл @mn | species @en-ca | జాతి @te | Chéng @nan | druh @cs | Spesie @af | Especie @an | প্ৰজাতি @as | Especie @ast | Bioloji növ @az | Төр @ba | Oart @bar | біялагічны від @be | від @be-tarask | вид @bg | প্রজাতি @bn | Spesad @br | جۆرە @ckb | spezia @co | Тĕс @cv | Rhywogaeth @cy | art @da | Art @de-ch | είδος @el | Liik @et | espezie @eu | laji @fi | Slach @frr | soarte @fy    species WD:Q7432    https://www.wikidata.org/wiki/Q7432 
jhpoelen commented 6 years ago

just added a feature to nomer called "replace", which allows for:

echo -e "\tsoort | orde | familie" | java -jar nomer.jar replace globi-taxon-rank to produce

WD:Q7432 | WD:Q36602 | WD:Q35409\tspecies | order | family . Rather than appending the matched results to the end of the line, replace puts the matches inside the existing columns / table. Note that soort, orde and familie are Dutch common names for ranks species, order and family respectively. The preceding ids are the corresponding wiki data taxon rank items, like https://www.wikidata.org/wiki/Q7432 (species). Note that replace also supports pipe separated lists.

This replace command should make it easier to normalize existing tabular files without having to cut/paste columns all over the place.

cboettig commented 6 years ago

@jhpoelen I've applied the rank map to the taxonCache to standardize the rank names as discussed above. I've noticed that for some IDs, this results in conflicting names for a given rank, e.g.

  id                species                 path           path_rank
   <chr>             <chr>                   <chr>          <chr>    
 1 INAT_TAXON:121955 Blechnum discolor       Polypodiopsida class    
 2 INAT_TAXON:121955 Blechnum discolor       Filicopsida    class    
 3 INAT_TAXON:122082 Phegopteris connectilis Polypodiopsida class    
 4 INAT_TAXON:122082 Phegopteris connectilis Filicopsida    class    
 5 INAT_TAXON:122909 Asplenium bulbiferum    Polypodiopsida class    
 6 INAT_TAXON:122909 Asplenium bulbiferum    Filicopsida    class    
 7 INAT_TAXON:127079 Polypodium vulgare      Polypodiopsida class    
 8 INAT_TAXON:127079 Polypodium vulgare      Filicopsida    class    
 9 INAT_TAXON:129910 Chthamalus dalli        ""             class    
10 INAT_TAXON:129910 Chthamalus dalli        Hexanauplia    class 
...

This could easily be me just messing something up, haven't had a chance to dig into why I'm getting two different class names mapping to the same taxon id in these cases but thought I'd file this as a placeholder at least.

jhpoelen commented 6 years ago

@cboettig you've uncovered an inconsistency between Global Names' perspective of iNaturalist taxonomy versus the iNaturalist taxonomy available through their rate limited API.

Note for instance, https://www.inaturalist.org/taxa/121955 (retrieved 2018-06-09, see screenshot), has class Polypodiopsida, whereas a search in http://resolver.globalnames.org results in the same taxon related to a different class Filicopsida (see attached json retrieved 2018-06-09). Are there any non-iNaturalist taxa in this list?

screenshot from 2018-06-09 11-11-08 globalNamesiNat121955.json.txt

cboettig commented 6 years ago

Yeah, looks like it includes these prefixes

1 GBIF      
2 INAT_TAXON
3 IRMNG     
4 NBN       
5 NCBI      
6 OTT       
7 WORMS 

~ from about 132,015 ids

jhpoelen commented 6 years ago

Thanks for sharing. I poked around a bit and the root cause of the multiple entries of ids range from a changing interpretation of the id (e.g., iNaturalist vs global names' inaturalist), a difference in interpretation (e.g., open tree of life is a synthetic taxonomy with underlying source taxonomies which sometimes overlap - I used the underlying taxonomy in path entries, whereas global names uses the ott synthetic ids in path) and silly export bugs (that pre-date nomer) like having an external url be inferred from an equivalent taxon.

I created the taxon cache to overcome the reproducibility and performance hurdles introduced by data exposed via web apis. Now, it appears that the taxonCache is exposing the different interpretation of specific taxon ids by different systems (e.g., GBIF ids can be resolved in three ways: via eol.org, via gbif api, via globalnames, and indirectly via wikidata). Also, given my experience that APIs produce time dependent results, I can see how taxonCache is almost turning into some funky time series on the interpretation of a specific id.

All the more reason to urge those that publish identifiers and their interpretation to publish easy to use, versioned archives so that compiling consistent perspectives on a name graph can be done without breaking the bank.

However, knowing that a consistent, versioned (how was this id/name interpreted in 2014 by xyz?), global names id mapping registry might take a while to materialize, this leaves some pragmatic questions - should GloBI be responsible to produce a consistent, non-conflicting interpretation of term/taxon ids? Or should GloBI do a best effort to link names and ids and include inconsistent interpretations?

cboettig commented 6 years ago

Well said. Seems, perhaps unsurprisingly, that not all identifiers are created equal here; and these issues may be more acute with some than with others? Ie ITIS doesn’t seem to exhibit this tendency to resolve differently from different sources? Perhaps there is a subset of ID namespace authorities GLOBI could deem reliable?

Re versioning, my understanding was that a proper uri ID ought to be permanent in how it resolves and what it resolves to. Changes should get new ids. Information associated with the ID ought to include a date published , allowing one to decide which of two ids for the “same” species is most current, right?

jhpoelen commented 6 years ago

@cboettig lets say that an identifier is a coded outcome of a shared understanding of a classification of some group of organisms by a group of humans at a particular time. Now, assuming that the shared understanding and group dynamics change over time, it is intuitive to see an identifier as time dependent. Studying the availability, use and interpretation of a specific identifier would be a proxy for studying the dynamics of the group of humans that maintains the identifier. With a linked name graph like the one that GloBI taxon graph provides, you can not only study the shift/drift in interpretation of taxon ids, you can also compare the drift to equivalent taxonomic ids from other name source. I believe such studies have been done with scientific names, and haven't yet seen a similar exercise with identifiers. I am not quite sure whether to say that "shifting" identifiers are necessary less reliable. They might actually be an indication of an active curation of a taxonomic naming scheme.

Not quite sure about the statement "changes should get new ids". Practically, I'd say that identifiers should be resolvable / retrievable, be marked as deprecated, replaced or updated by the id curators, and keep older copies around with some sort of versioning scheme for as long as the project is maintained. Depending on the use case, the consumer of the ids can then select the appropriate version of the id. Whichever way the versioning is implemented (e.g., datetime stamp, git-like ledger/content hash), the adoption of this hinges on the practical use: given that most content publishers have trouble enough keeping their content relevant and accessible, I would assume that versioning of the published artifacts is probably best left to specialized services similar to the Internet Archive's way back machine.

All very complicated, and for practical purposes, I'd say that keeping versioned copies around for all data retrieved from a name source by individual projects would be a start. This is what I settled on by creating GloBI's taxon graph. As far as I can tell, publishing such an easy to access/parse name graph makes for interesting observations of the underlying services used.

jhpoelen commented 6 years ago

btw - Nico Franz was kind enough to point me to Franz, N.M. & Peet, R.K., 2009. Perspectives: Towards a language for mapping relationships among taxonomic concepts. Systematics and Biodiversity, 7(1), pp.5–20. Available at: http://dx.doi.org/10.1017/s147720000800282x.

jhpoelen commented 4 years ago

see https://github.com/globalbioticinteractions/nomer/issues/5#issuecomment-597956495