ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Utility to convert between schema.org <--> EML #218

Open cboettig opened 7 years ago

cboettig commented 7 years ago

Add function to generate schema.org markup from EML. Ideally this would be done in a language-agnostic rather than R-specific way, perhaps an XSLT stylesheet maintained by the EML team.

Google announced it's interest in improving data discovery, and lists datasets as one of a rather small handful of schema.org types that actually have tangible differences to how it presents results (along with that markup that shows information like business hours when you search for it), see: https://developers.google.com/search/docs/data-types/datasets. So while it's clearly early days and might never take off, I think it's notable that this type is highlighted in this way (in contrast to say, SoftwareApplication or SoftwareSourceCode). Obviously the vocabulary is way more limited, but not completely blockheaded (e.g. does include relatively rich notions of spatial and temporal coverage, variablesMeasured, and encourages using DOIs for identifiers).

Could potentially improve data discovery of data documented in emldown pages too.

amoeba commented 6 years ago

This was a cool idea when I 👍 'd it a while back and I wouldn't mind revisiting it and getting it done.

Ideally this would be done in a language-agnostic rather than R-specific way, perhaps an XSLT stylesheet maintained by the EML team.

What did you have in mind here? I'm scratching my head thinking of a way to make XSLT produce JSON-LD. I agree that it would be handy to have something lang-agnostic but at least having a reference implementation would be helpful.

mbjones commented 6 years ago

I did an experiment with our MetacatUI display of EML and other standards about 18 months ago where I embedded schema.org Dataset tags in our dataset landing pages (the #view service). It was fairly straightforward, except that MetacatUI is a single page app that loads the landing page only via javascript, and so was not visible to most clients. We need a way to generate this via a URI call to the dataset landing page that pre-loads the info before the javascript is executed. Let's discuss, this may be a great GeoLink project @amoeba, possibly generating new landing pages and using a new canonical dataset URI-space for DataONE.

cboettig commented 6 years ago

@amoeba @mbjones Thanks for the feedback.

Re implementation, I agree a pure XSLT strategy probably isn't a good idea; haven't really thought this through. What I'd like though is some way that captures the mapping between the two vocabularies in a language agnostic way, that could then be implemented in R or another language.

If all the mappings were 1:1, this would be easy; e.g. as a crosswalk table / csv file, or what json-ld does with context files. Unfortunately it's not so obvious how to me how to do this when the mapping is more complex, e.g. converting this structure to EML:

  "spatialCoverage:" {
       "@type": "Place",
       "geo": {
         "@type": "GeoCoordinates",
         "latitude": 39.3280
         "longitude": 120.1633
       }
     }

cannot simply be captured in a crosswalk. Seems like there should still be a good abstraction of how this maps into EML that can be expressed in something not entirely language specific (maybe a query language like a set of jq commands or xpaths) but I dunno. (Surely CS has some theory about mappings between graphs / trees in this context; if only to say "it can't work, dummy"). Would love any suggestions

mbjones commented 6 years ago

This is a classic schema alignment problem, and one that has been explored in depth in CS. I particularly like the Sheth and Larson 1990 overview of federated database systems , but there's also a high-level overview on Wikipedia on Schema Matching, with links to related topics like data mapping and ontology alignment.

The field has now shifted focus to ontology alignment, which is a large and more complicated topic as it includes additional facets that must be considered to determine logical equivalence.

There are a ton of tools in this space.

cboettig commented 6 years ago

Thanks Matt! This is just what I was looking for.

Would really love to hear your take on this space at some stage; e.g. I gather my earlier tongue-in-cheek suggestion that we just turn it all to RDF and use ontology tools is hopelessly impractical; but that there should plenty that is useful & practical in this space (e.g. I'm assuming this particular mapping between EML->schema.org, while not trivial/flat, also probably doesn't involve some of the worst / undecidable pathologies that are possible in the abstract schema mapping problem). How have you approached these issues with previous mappings (e.g. the existing XSLT transforms available in EML to other XML-based schema?) Or does it always come down to brute force in the end?

mbjones commented 6 years ago

We've brute forced it via explicit mappings via XSLT. Given how limited the schema.org Dataset vocabulary is (or at least needs to be), I'm sure a manual mapping would be by far the fastest route, compared to trying to come up with some sort of general approach.

amoeba commented 6 years ago

That's how I feel too. I can send over a PR this weekend. Should give us something concrete to look over.

amoeba commented 6 years ago

This was an interested foray. Some initial observations:

Here's a gist with my script: https://gist.github.com/amoeba/0b6f5cb497b3cf41113fdcaf049b7679

I can PR if it looks like a good start/direction and then I can make it better/safer/whatever later and add some tests.

cboettig commented 6 years ago

@amoeba Very cool, still need to find some time to take a closer look at this.

Re contact, I believe the term you want is contactPoint as a property of a creator, e.g. see the full example from Google's page on Dataset:

"creator":{
     "@type":"Organization",
     "url": "https://www.ncei.noaa.gov/",
     "name":"OC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce",
     "contactPoint":{
        "@type":"ContactPoint",
        "contactType": "customer service",
        "telephone":"+1-828-271-4800",
        "email":"ncei.orders@noaa.gov"
     }
  }

it's at the link called Markup which I cannot seem to copy and paste as a real link... Of course, Google's own example there does not pass Google's own validator, which for some reason believes creator must have a URL. Worse, it seems like Google has a controlled vocabulary for the possible terms that contactType can take, (e.g. "customer service" is not what we usually have in mind), but I cannot find the actual schema file Google is using in the SDTT, nor can I find comprehensive documentation anywhere.

Yeah, good question about wether one should bother defining S4 classes for the Schema.org object types; certainly more expedient not to. A little voice at the back of my head always seems to tell me that S4 class structure, as rich as it is, really doesn't have a perfect one-to-one correspondence with XML, and so I'm not really sure it's the best thing to represent schema classes anyhow (e.g. obvious hacks with repeated elements, ordered vs unordered elements, the way class inheritance works, etc). Thankfully JSON doesn't have the same complexity though so a 1:1 correspondence to S4 (or even S3) would probably be pretty easy to do. Overall having a stand-alone native representation could be good; though maybe that is overkill and all one really needs is a list object.

cboettig commented 6 years ago

Re abstract, mapping to description makes sense to me. I think you're asking what to do about the extra markup of EML TextType? Any of the solutions seem fine to me: pandoc-ing out the Docbook / TextType markup, just stripping the html elements, or even just leaving them in.

amoeba commented 6 years ago

Ah yes, sorry for the low clarity. Yes, that's what I'm thinking. I'll give that a whirl and see if I/we are satisifed. Initial tests of Pandoc's plaintext output format are positive.

amoeba commented 6 years ago

I wrapped up a minimum viable first version and sent a PR #228. I think it's good enough to kick the tires on but not good enough to release into the wild (yet!). Thoughts/comments welcome.

cboettig commented 5 years ago

Still interested in this, still not sure the best way to go about it. Had thought we might pull this off just by writing a JQ mapping, but I now think JQ isn't rich enough for that. An R-based implementation like you're previous PR, but based on the new S3/list objects instead, may be the most expedient way to go. @mbjones @amoeba any updates as to how you're handling this on your end?

mbjones commented 5 years ago

Bryce has now completed and deployed a schema.org transformation for all DataONE data sets, regardless of metadata standard used. It is pretty high-level, but a good start. Any landing page on DataONE will contain it in a script element. For example, here's the JSON-LD for an example EML dataset (https://search.dataone.org/view/doi:10.18739/A2P55DG5N):

{
    "@context": {
        "@vocab": "http://schema.org"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2P55DG5N",
    "datePublished": "2018-01-01T00:00:00Z",
    "publisher": "Arctic Data Center",
    "identifier": "doi:10.18739/A2P55DG5N",
    "url": "https://dataone.org/datasets/doi%3A10.18739%2FA2P55DG5N",
    "schemaVersion": "eml://ecoinformatics.org/eml-2.1.1",
    "name": "River discharge data, National Petroleum Reserve, Alaska, 2001-2017",
    "creator": ["Richard Kemnitz", "Christopher Arp", "Matthew Whitman", "Dragos Vas"],
    "citation": "Richard Kemnitz, Christopher Arp, Matthew Whitman, and Dragos Vas. 2018. River discharge data, National Petroleum Reserve, Alaska, 2001-2017. Arctic Data Center. doi:10.18739/A2P55DG5N.",
    "spatialCoverage": {
        "@type": "Place",
        "additionalProperty": [{
            "@type": "PropertyValue",
            "additionalType": "http://dbpedia.org/resource/Coordinate_reference_system",
            "name": "Coordinate Reference System",
            "value": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
        }],
        "geo": {
            "@type": "GeoCoordinates",
            "latitude": 70.27,
            "longitude": -151.87
        },
        "subjectOf": {
            "@type": "CreativeWork",
            "fileFormat": "application/vnd.geo+json",
            "text": "{\"type\":\"Point\",\"coordinates\":[-151.87,70.27]}"
        }
    },
    "temporalCoverage": "2001-01-01T00:00:00Z/2017-12-31T00:00:00Z",
    "variableMeasured": ["Date", "river", "discharge", "latitude", "longitude"],
    "description": "Bureau of Land Management hydrologist Richard Kemnitz collected discharge records from 2001 to 2017 for several rivers in the National Petroleum Reserve in Alaska (NPR-A) in cooperation with the U.S. Geological Survey, University of Alaska Fairbanks, the Arctic Landscape Conservation Cooperative, and the National Science Foundation. These valuable records for a remote roadless region of the Arctic represent the hydrological response of watersheds in a region undergoing expanded development by the petroleum industry and over a time period of notable climate change in Arctic Alaska. Hydrologic datasets such as these are also being utilized for a new NSF funded project titled \"Causes and Consequences of Catastrophic Lake Drainage in an Evolving Arctic System (OPP-1806287)\" to assess changes in arctic hydrology due to flood events generated from drained thermokarst lake basins.",
    "keywords": "arctic, rivers, watersheds, Alaska, hydrology"
}
amoeba commented 5 years ago

@mbjones is right!

I think this would a good addition to this package. I think it wouldn't be a ton of work to rewrite https://github.com/ropensci/EML/pull/228. Do you/we have an upcoming release of this package to target this issue for?

cboettig commented 5 years ago

@amoeba that would be awesome.

I don't have a strict timeline for the upcoming release, though I've had the vague notion of trying to coincide with whenever the EML schema 2.2.0 becomes officially released (or soon after). I need emld on CRAN before I can release EML 2.0 to CRAN; you may have seen I've just put it into rOpenSci onboarding now.

With the semester wrapping up I'm hoping to make a push to get some packages over the hump before we start up again mid-January, ideally with a short methods paper about the EML package as well on which I'm hoping to have you and @mbjones as co-authors.

amoeba commented 5 years ago

That sounds great. I'm busy until about mid-December but I might squeeze it in somewhere before then.

yvanlebras commented 4 years ago

Really interested by this work with @earnaud

jmlord commented 1 year ago

Hi, We'd be interested as well. Any news on that topic (or other external tools that proved to do it correctly)?

mbjones commented 1 year ago

@jmlord There will be a discussion about converting EML to schema.org during the next science-on-schema.org ESIP cluster call, this Thursday April 27, at 2:30pm EDT. @clnsmth has been working on this over in https://github.com/ESIPFed/science-on-schema.org/issues/238 Here's the agenda with connection info:

https://docs.google.com/document/d/1tIlDVnKeocO1E_SSbNaldv0avORfGFdmYDNk_3ub6ik

jmlord commented 1 year ago

@mbjones What a timing! I cannot be there myself, but I will try to find someone else in the project to join. Thanks for letting me know.