Open cboettig opened 6 years ago
Interesting! Could you please provide a more specific example?
Good question, I'm still wrapping my own head around just what that would be. One possible configuration might be something like:
{"@context": {"@vocab": "http://example.com/", "parent": {"@type": "@id"}},
"@graph": [
{"name": "Animalia", "rank": "kingdom", "@id": "GBIF:1"},
{"name": "Chordata", "rank": "phylum", "@id": "GBIF:44", "parent": "GBIF:1"},
{"name": "Mammalia", "rank": "class", "@id": "GBIF:359", "parent": "GBIF:44"},
{"name": "Carnivora", "rank": "order", "@id": "GBIF:732", "parent": "GBIF:359"},
{"name": "Canidae", "rank": "family", "@id": "GBIF:9701", "parent": "GBIF:732"},
{"name": "Canis", "rank": "genus", "@id": "GBIF:5219142", "parent": "GBIF:9701" },
{"name": "Canis lupus", "rank": "species", "@id": "GBIF:5219173", "parent": "GBIF:5219142"}
]
}
This can equivalently be represented as a set of triples, but is quite nice in JSON. JSON is easier to parse than pipe strings, and the JSON-LD algorithms can be nice for manipulating this, e.g. to converted it into a nested structure by parent or by child (e.g. by defining the @reverse
property in the @context
, see the above example in the JSON-LD Playground .
This can also obviously be rendered to the equivalent rdf. Clearly one would want a more intelligent choice for @context
than the "@vocab" : "http://example.com"
, e.g. the rank terms should probably be defined in darwin core, etc; that's probably in the source data anyway.
I didn't play with this example yet mixing and matching entries from different authorities, but clearly that would be important. Obviously if you had all the different examples in the above format you could already query the JSON to say, give me all @id
s that have "name": "Canis lupus"
(and maybe also have "rank": "species"
), so it would be a good start already, though it might be interesting to think about adding triples/statements explicitly defining some relation between them. (Doing something like GBIF:id owl:sameAs ITIS:id
is probably asking for trouble(??), even though that's probably how most researchers want to think about these ids...
Nice! I like the idea of bold statements like NCBI:9606 sameAs GBIF:2436436 because it invites discussion and help to codify assumptions that are often implicitly made already. Isn't a Homo sapiens a Homo sapiens ? I'd even argue that Homo sapiens sameAs Homo sapiens (often used) is less accurate than NCBI:9606 sameAs GBIF:2436436 , because the ids define (by reference) an explicit taxonomic context, whereas the strings leaves the machine (and most humans) guessing: is this just a sequence of characters or a taxonomic name?
Json-ld example looks pretty good, I'd go for including the sameAs things and making the rank a little more explicit like:
{"@context": {"@vocab": "http://example.com/", "parent": {"@type": "@id"}},
"@graph": [
{"name": "kingdom", "@id": "GBIF_RANK:1" },
{ "name": "phylum", "@id", "GBIF_RANK:2", "parent": "GBIF_RANK_1"},
{"name": "Animalia", "rank": "GBIF_RANK:1", "@id": "GBIF:1"},
{"name": "Chordata", "rank": "GBIF_RANK:2", "@id": "GBIF:44", "parent": "GBIF:1"},
...
]
}
Note that with existing features, using tab separated lines with pipes for internal arrays, you can do searches like "give me all the Anura, but exclude the plants "(note that, in addition to being an amphibian order, Anura is also a plant genus):
cat [some names] | java -jar nomer.jar append | grep "Anura" | grep -v "Plantae"
Granted that it is not as semantically explicit, is it quite fast and be quite accurate after perhaps using some column selection using awk. This said, I do agree that json-ld / jq provide much more powerful graph-like queries on the command-line .
How would you imagine using this future json-ld feature? Do you imagine that the format would be both input and output? Are you aware of any other projects that use this kind of format to exchange taxonomic (equivalence) data?
Nice, you make a convincing case that we should just go for it with sameAs
. It certainly seems justified to assert that an NCBI identifier is the sameAs
the GBIF identifier when the strings match and all, though I imagine there are some edge cases like where one authority has split a species and the other has not.
Yeah, I do like the simplicity of the pipe lists, (and looks like they have a precedent in Darwin core higherTaxonomy?) but they can be tricky to use when you want to keep track of which rank Anura
is supposed to refer to, and maybe according to which authority. I tell my students that grep
should be thought of as a last resort, when the data provider hasn't given you a more structured way of doing something, since it is easier for it to have unanticipated side-effects like getting plant genus instead of a vertebrate order! So in general, I think a well-thought-out JSON structure would be ideal.
Good question about the JSON-LD framing. My thinking was that the nice thing about JSON-LD frames is that it gives the user (or at least the app developer) some control over what their preferred JSON-LD structure might be -- in particular, I was thinking that it might be nicer for the user to have a nested JSON file than to have to resolve the parent
links manually, and JSON-LD handles that rather nicely (even letting you reverse the nesting). However, with a little more thought, I'm not really sure that a nested structure is particularly useful. I think what most researchers would find most intuitive would be for rank names to act as keys rather than values in the JSON, something like:
{
"species": {"name": "Canis lupus", "@id": "GBIF:5219173", "sameAs": "ITIS:180596", ...},
"genus": {"name": "Canis", "@id": "GBIF:5219142"},
"family": {"name": "Canidae", "@id": "GBIF:9701"},
"order": {"name": "Carnivora", "@id": "GBIF:359"},
"class": {"name": "Mammalia","@id": "GBIF:732"},
"phylum": {"name": "Chordata", "@id": "GBIF:44"},
"subkingdom": {"name": "Bilateria", "@id": "ITIS:914154"},
"kingdom": {"name": "Animalia", "@id": "GBIF:1"}
}
Do you think something like that is possible? It is clearly a little more dicey semantically -- it requires divorcing the rank levels (like phylum
, subphylum
etc) from a particular authority, and it also doesn't leave room for different authorities to have different name strings for the same rank (though I think that is implicit in using sameAs
). Maybe those issues (and maybe others) make it unworkable, but I think it corresponds to how most ecologists want to think about and use taxonomic names.
e.g. at least for something like this wolf, ITIS might provide additional ranks, but agrees about the names of all the ranks that match GBIF ranks (i.e. both agree the "Class" is "Mammalia"). not sure if that would hold in general.
Side comment: provided we don't butcher the semantics, using JSON-LD instead of plain JSON means the data automatically has a sensible RDF serialization as well, which might appeal to the hardcore biodiversity informatics folks. Or put another way, one could think of this as starting with a dump of RDF triple statements from all of the authorities, and we are just defining a JSON-LD frame that parses said triples into a more developer-friendly JSON structure...
Did some experimentation today with your idea, and came up with something like the output below: one json object per line. Note that I left out all taxonomic ranks. These can be added once we settle on a json-ld format.
Regarding the semantics of ranks: rather than interpreting it as a property, I see is as a relationship: so, species: { X }
would be interpreted as X is a taxon of rank species, where species
can be linked to a (coded) relationship from some taxonomy ontology. "same_as" would be handled similarly.
@cboettig curious to hear your thoughts on this.
echo -e "ITIS:180596\tCanis lupus" | java -jar nomer/target/nomer-0.0.1-SNAPSHOT-jar-with-dependencies.jar appendJson globi-globalnames | jq .
using matcher [org.eol.globi.taxon.GlobalNamesService]
{
"species": {
"@id": "NCBI:9612",
"name": "Canis lupus",
"same_as": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "OTT:247341",
"name": "Canis lupus",
"same_as": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "INAT_TAXON:42048",
"name": "Canis lupus",
"same_as": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "ITIS:180596",
"name": "Canis lupus",
"same_as": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "IRMNG:11407661",
"name": "Canis lupus",
"same_as": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "GBIF:5219173",
"name": "Canis lupus",
"same_as": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
@jhpoelen This is definitely interesting. Note that in your example, species
here is still acting as a "property" (a predicate in RDF speak), but a predicate that takes a node / reference / object (we have too many terms for the same concept), instead of taking a literal.
But I like this! You're probably right that it's wise to explicitly have a JSON object for each identifier. Note that your example could be "compacted" in JSON-LD by
{ "@context": {
"@vocab": "https://nomer.org/",
"same_as": {"@type": "@id"}
},
"@graph": [
{
"species": {
"@id": "NCBI:9612",
"name": "Canis lupus",
"same_as": "ITIS:180596"
}
},
{
"species": {
"@id": "OTT:247341",
"name": "Canis lupus",
"same_as": "ITIS:180596"
}
},
{
"species": {
"@id": "INAT_TAXON:42048",
"name": "Canis lupus",
"same_as": "ITIS:180596"
}
},
{
"species": {
"@id": "ITIS:180596",
"name": "Canis lupus",
"same_as": "ITIS:180596"
}
},
{
"species": {
"@id": "IRMNG:11407661",
"name": "Canis lupus",
"same_as": "ITIS:180596"
}
},
{
"species": {
"@id": "GBIF:5219173",
"name": "Canis lupus",
"same_as": "ITIS:180596"
}
}]
}
i.e. see that in action here: http://tinyurl.com/y7rh648u
Although the prefix is irrelevant at the RDF level, I would encourage NCBITaxon as the prefix since that's standard in OBO.
If you want to use the obolibrary purls for NCBITaxon, then the correct OWL construct is owl:equivalentClasses and not owl:sameAs, otherwise you induce punning (but conversely this induces the ITIS IRI to be an owl:Class, which may not be their intent...). Or you can just ignore OWL semantics and use sameAs, YMMV...
@cmungall Thanks much for weighing in here. Yeah, I suspected owl:sameAs
could have some unintended consequences -- what does 'induce punning' mean?
See: https://www.w3.org/TR/owl2-new-features/#F12:_Punning
The arguments for sameAs must be individuals, the arguments for equivalentClasses must be classes. individuals and classes are disjoint in OWL-DL. However, if something is inferred to be both it doesn't create a problem as they are assumed to be different entities with the same name.
It has no effect on RDF-level interpretations, only on the OWL interpretation of the graph
good stuff! Leaving out the plumbing and compaction:
echo -e "ITIS:180596\tCanis lupus" | java -jar nomer/target/nomer-0.0.1-SNAPSHOT-jar-with-dependencies.jar appendJson globi-globalnames | jq .
now produces:
{
"species": {
"@id": "NCBITaxon:9612",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "OTT:247341",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "INAT_TAXON:42048",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "ITIS:180596",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "IRMNG:11407661",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "GBIF:5219173",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
This is assuming that non NCBITaxon: ids have some kind of class hierarchy. Perhaps a way to motivate others / ourselves to repeat http://obofoundry.org/ontology/ncbitaxon.html with other taxonomies. . . .
@cboettig @cmungall is this what you had in mind?
Come to think of it, Nomer can now perhaps take on the role of term class hierarchy builder. Imagine ...
👏 I like where this is going.
So how feasible would it be to create a JSON dump like this for every ID that nomer
knows? Is it stupid to create that kind of static record? My intuition is that having such a JSON blob would be easier to develop other tooling against than introducing dependency on a particular software or web-api to do this stuff. Guessing the file would be large but probably manageable?
Feasible with performance varying by the matcher. For instance, the globi-globalnames
and globi-enrich
matchers talk to web APIs, so you'd have to feed in the world to match with it and would take a while. However, matchers like globi-cache
uses a taxon graph a la https://doi.org/10.5281/zenodo.755513 is used. And . . . these graphs can be expressed in json.
I don't think static records are stupid. In fact, I think dynamic records are stupid if they don't leave a trail of static records. And I can point to static records, archive them and give them to friends. So, for lack of better terms, I think that static records are pretty smart.
@jhpoelen Very cool. Also somehow I hadn't seen https://doi.org/10.5281/zenodo.755513 before, that's very handy. yay for convenient static records.
I've extended the suggested json output to include ranks.
Now, echo -e "ITIS:180596\tCanis lupus" | java -jar nomer/target/nomer-0.0.1-SNAPSHOT-jar-with-dependencies.jar append-json globi-globalnames | jq .
produces the result included below.
Please note that path
element is added to include all taxonomic ranks, even those that have no rank name or have ids, but no names. The addition of the path
element makes the format a little lenient to (notoriously) usage of non-standard ranks, or ranks in latin (e.g., regnum vs kingdom).
{
"species": {
"@id": "NCBITaxon:9612",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
},
"norank": {
"@id": "NCBITaxon:131567",
"name": ""
},
"superkingdom": {
"@id": "NCBITaxon:2759",
"name": "Eukaryota"
},
"kingdom": {
"@id": "NCBITaxon:33208",
"name": "Metazoa"
},
"phylum": {
"@id": "NCBITaxon:7711",
"name": "Chordata"
},
"subphylum": {
"@id": "NCBITaxon:89593",
"name": "Craniata"
},
"class": {
"@id": "NCBITaxon:40674",
"name": "Mammalia"
},
"superorder": {
"@id": "NCBITaxon:314145",
"name": "Laurasiatheria"
},
"order": {
"@id": "NCBITaxon:33554",
"name": "Carnivora"
},
"suborder": {
"@id": "NCBITaxon:379584",
"name": "Caniformia"
},
"family": {
"@id": "NCBITaxon:9608",
"name": "Canidae"
},
"genus": {
"@id": "NCBITaxon:9611",
"name": "Canis"
},
"path": {
"names": [
"",
"Eukaryota",
"Opisthokonta",
"Metazoa",
"Eumetazoa",
"Bilateria",
"Deuterostomia",
"Chordata",
"Craniata",
"Vertebrata",
"Gnathostomata",
"Teleostomi",
"Euteleostomi",
"Sarcopterygii",
"Dipnotetrapodomorpha",
"Tetrapoda",
"Amniota",
"Mammalia",
"Theria",
"Eutheria",
"Boreoeutheria",
"Laurasiatheria",
"Carnivora",
"Caniformia",
"Canidae",
"Canis",
"Canis lupus"
],
"ids": [
"NCBI:131567",
"NCBI:2759",
"NCBI:33154",
"NCBI:33208",
"NCBI:6072",
"NCBI:33213",
"NCBI:33511",
"NCBI:7711",
"NCBI:89593",
"NCBI:7742",
"NCBI:7776",
"NCBI:117570",
"NCBI:117571",
"NCBI:8287",
"NCBI:1338369",
"NCBI:32523",
"NCBI:32524",
"NCBI:40674",
"NCBI:32525",
"NCBI:9347",
"NCBI:1437010",
"NCBI:314145",
"NCBI:33554",
"NCBI:379584",
"NCBI:9608",
"NCBI:9611",
"NCBI:9612"
],
"ranks": [
"",
"superkingdom",
"",
"kingdom",
"",
"",
"",
"phylum",
"subphylum",
"",
"",
"",
"",
"",
"",
"",
"",
"class",
"",
"",
"",
"superorder",
"order",
"suborder",
"family",
"genus",
"species"
]
}
}
{
"species": {
"@id": "OTT:247341",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
},
"no rank": {
"@id": "OTT:805080",
"name": ""
},
"domain": {
"@id": "OTT:304358",
"name": "Eukaryota"
},
"kingdom": {
"@id": "OTT:691846",
"name": "Metazoa"
},
"phylum": {
"@id": "OTT:125642",
"name": "Chordata"
},
"subphylum": {
"@id": "OTT:947318",
"name": "Craniata"
},
"superclass": {
"@id": "OTT:278114",
"name": "Gnathostomata"
},
"class": {
"@id": "OTT:458402",
"name": "Sarcopterygii"
},
"subclass": {
"@id": "OTT:229558",
"name": "Theria"
},
"superorder": {
"@id": "OTT:392223",
"name": "Laurasiatheria"
},
"order": {
"@id": "OTT:44565",
"name": "Carnivora"
},
"suborder": {
"@id": "OTT:827263",
"name": "Caniformia"
},
"family": {
"@id": "OTT:770319",
"name": "Canidae"
},
"genus": {
"@id": "OTT:372706",
"name": "Canis"
},
"path": {
"names": [
"",
"",
"Eukaryota",
"Opisthokonta",
"Holozoa",
"Metazoa",
"Eumetazoa",
"Bilateria",
"Deuterostomia",
"Chordata",
"Craniata",
"Vertebrata",
"Gnathostomata",
"Teleostomi",
"Euteleostomi",
"Sarcopterygii",
"Dipnotetrapodomorpha",
"Tetrapoda",
"Amniota",
"Mammalia",
"Theria",
"Eutheria",
"Boreoeutheria",
"Laurasiatheria",
"Carnivora",
"Caniformia",
"Canidae",
"Canis",
"Canis lupus"
],
"ids": [
"OTT:805080",
"OTT:93302",
"OTT:304358",
"OTT:332573",
"OTT:5246131",
"OTT:691846",
"OTT:641038",
"OTT:117569",
"OTT:147604",
"OTT:125642",
"OTT:947318",
"OTT:801601",
"OTT:278114",
"OTT:114656",
"OTT:114654",
"OTT:458402",
"OTT:4940726",
"OTT:229562",
"OTT:229560",
"OTT:244265",
"OTT:229558",
"OTT:683263",
"OTT:5334778",
"OTT:392223",
"OTT:44565",
"OTT:827263",
"OTT:770319",
"OTT:372706",
"OTT:247341"
],
"ranks": [
"no rank",
"no rank",
"domain",
"no rank",
"no rank",
"kingdom",
"no rank",
"no rank",
"no rank",
"phylum",
"subphylum",
"subphylum",
"superclass",
"no rank",
"no rank",
"class",
"no rank",
"superclass",
"no rank",
"class",
"subclass",
"no rank",
"no rank",
"superorder",
"order",
"suborder",
"family",
"genus",
"species"
]
}
}
{
"species": {
"@id": "INAT_TAXON:42048",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
}
}
{
"species": {
"@id": "ITIS:180596",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
},
"kingdom": {
"@id": "ITIS:202423",
"name": "Animalia"
},
"subkingdom": {
"@id": "ITIS:914154",
"name": "Bilateria"
},
"infrakingdom": {
"@id": "ITIS:914156",
"name": "Deuterostomia"
},
"phylum": {
"@id": "ITIS:158852",
"name": "Chordata"
},
"subphylum": {
"@id": "ITIS:331030",
"name": "Vertebrata"
},
"infraphylum": {
"@id": "ITIS:914179",
"name": "Gnathostomata"
},
"superclass": {
"@id": "ITIS:914181",
"name": "Tetrapoda"
},
"class": {
"@id": "ITIS:179913",
"name": "Mammalia"
},
"subclass": {
"@id": "ITIS:179916",
"name": "Theria"
},
"infraclass": {
"@id": "ITIS:179925",
"name": "Eutheria"
},
"order": {
"@id": "ITIS:180539",
"name": "Carnivora"
},
"suborder": {
"@id": "ITIS:552303",
"name": "Caniformia"
},
"family": {
"@id": "ITIS:180594",
"name": "Canidae"
},
"genus": {
"@id": "ITIS:180595",
"name": "Canis"
},
"path": {
"names": [
"Animalia",
"Bilateria",
"Deuterostomia",
"Chordata",
"Vertebrata",
"Gnathostomata",
"Tetrapoda",
"Mammalia",
"Theria",
"Eutheria",
"Carnivora",
"Caniformia",
"Canidae",
"Canis",
"Canis lupus"
],
"ids": [
"ITIS:202423",
"ITIS:914154",
"ITIS:914156",
"ITIS:158852",
"ITIS:331030",
"ITIS:914179",
"ITIS:914181",
"ITIS:179913",
"ITIS:179916",
"ITIS:179925",
"ITIS:180539",
"ITIS:552303",
"ITIS:180594",
"ITIS:180595",
"ITIS:180596"
],
"ranks": [
"Kingdom",
"Subkingdom",
"Infrakingdom",
"Phylum",
"Subphylum",
"Infraphylum",
"Superclass",
"Class",
"Subclass",
"Infraclass",
"Order",
"Suborder",
"Family",
"Genus",
"Species"
]
}
}
{
"species": {
"@id": "IRMNG:11407661",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
},
"kingdom": {
"@id": "IRMNG:11",
"name": "Animalia"
},
"phylum": {
"@id": "IRMNG:148",
"name": "Chordata"
},
"class": {
"@id": "IRMNG:1310",
"name": "Mammalia"
},
"order": {
"@id": "IRMNG:12116",
"name": "Carnivora"
},
"family": {
"@id": "IRMNG:104585",
"name": "Canidae"
},
"genus": {
"@id": "IRMNG:1282727",
"name": "Canis"
},
"path": {
"names": [
"Animalia",
"Chordata",
"Mammalia",
"Carnivora",
"Canidae",
"Canis",
"Canis lupus"
],
"ids": [
"IRMNG:11",
"IRMNG:148",
"IRMNG:1310",
"IRMNG:12116",
"IRMNG:104585",
"IRMNG:1282727",
"IRMNG:11407661"
],
"ranks": [
"kingdom",
"phylum",
"class",
"order",
"family",
"genus",
"species"
]
}
}
{
"species": {
"@id": "GBIF:5219173",
"name": "Canis lupus",
"equivalent_to": {
"@id": "ITIS:180596",
"name": "Canis lupus"
}
},
"kingdom": {
"@id": "GBIF:1",
"name": "Animalia"
},
"phylum": {
"@id": "GBIF:44",
"name": "Chordata"
},
"class": {
"@id": "GBIF:359",
"name": "Mammalia"
},
"order": {
"@id": "GBIF:732",
"name": "Carnivora"
},
"family": {
"@id": "GBIF:9701",
"name": "Canidae"
},
"genus": {
"@id": "GBIF:5219142",
"name": "Canis"
},
"path": {
"names": [
"Animalia",
"Chordata",
"Mammalia",
"Carnivora",
"Canidae",
"Canis",
"Canis lupus"
],
"ids": [
"GBIF:1",
"GBIF:44",
"GBIF:359",
"GBIF:732",
"GBIF:9701",
"GBIF:5219142",
"GBIF:5219173"
],
"ranks": [
"kingdom",
"phylum",
"class",
"order",
"family",
"genus",
"species"
]
}
}
@cboettig I'd like to better understand the future use of this feature. Ideally, this would help to get people to use Nomer, so that feedback loops can be established.
@jhpoelen Fair question; I'm still experimenting here myself so I don't know the answer entirely. My thoughts / premise so far:
tsv
files while also remaining an explicitly valid link-data format that's directly posed for applying semantic tooling like sparql, and also web-friendlySo those are pretty vague thinking at this point, but this data seems to be about the right complexity (simple but not trivial) to dive into this exploration.
Did that make any sense?
@jhpoelen In the spirit of minimal / mobile tooling, I was wondering if it would be possible to expose all of the
nomer
data as a single rdf dump (or more developer friendly, as JSON-LD-formatted json-stream object?) Or maybe you already do something like this?