Advice: Open Context data for Pelagios

ekansa commented 8 years ago

Hi,

I want some advice on how to provide Open Context data for Pelagios. Open Context mainly publishes "fine-gained" descriptions of objects and archaeological contexts. A few objects published by Open Context link to places in gazetteers. For example, this coin has a mint that's in Pleiades:

http://opencontext.org/subjects/766ED8AA-3147-4CA2-6ECA-BD96BE0433BE

But that's a rare case. Most of items (like animal bones, potsherds, etc.) do not directly link to places in gazetteers, instead, they relate indirectly. For instance, this link describes a site in Open Context: http://opencontext.org/subjects/871B9EF8-BC68-4190-5F8A-00882C0040A4

The site links to a place in Pleiades. However, that site in Open Context has thousands of excavation contexts, artifacts, bones etc. It also has thousands of images and documents associated with it, see:

http://opencontext.org/search/Italy/Poggio+Civitate#15/43.1526/11.4093/18/tile/Google-Satellite

So what would be the best way to expose this variety of materials to Pelagios? I'm sure Pelagios would not want to have a URI for each animal bone found at a gazetteer linked site (or do you???). However, it may be useful to Pelagios users to note that we have 40,000 pictures, 15,000 data records, and 3000 documents that describe materials found at a gazetteer linked site.

I can link to the query results for these big aggregates of data records, media items, and documents that are from this site. But these links are URL's not really properly URI's. But would that matter for Pelagios? Should I make an annotation for a URL like:

http://opencontext.org/search/Italy/Poggio+Civitate?type=documents

The link above can have an annotation that relates it to the Pleiades place that relates to the archaeological site Poggio Civitate in Open Context. However, is this an improper use of Pelagios?

Or should I output annotations for a few hundred thousand URIs that describe bones, pots, etc. from the 50 or so sites I've got linked up to gazetteers?

Thanks in advance for the help!

rsimon commented 8 years ago

Hm - it's a really good question. I'll forward this on to the rest of team via E-Mail as well. They're not too Github-native ;-)

Simple things first: in principle, the Pelagios data model does distinguish between Places and Things (related to places). So technically, nothing speaks against publishing every single item record as Pelagios RDF. (Another side-issue, by the way: since you have your places represented explicitly in OpenContext, it would make sense to expose those as a gazetteer (linked to Pleiades) first. And then link the objects to your own OpenContext place IDs. That might give us a slighly cleaner model, mirroring more directly how things are organized in OpenContext - and potentially make things slightly easier for you?)

As for the granularity of the object data... That's a much harder question to answer. (Elton & Leif - do chip in!)

My gut feeling is we may want to look at different categories of data, and then decide which level of granularity would be most suitable (followed by an assessment of how feasible it is for you to export this as Pelagios RDF then...)

At the end of the day, Pelagios aims to be about discoverability though. So it's probably better to stay at a slightly more "overview" level. It's probably of limited use to represent each animal bone at a site explicitely, as you say - those might be better published to Pelagios as a single "item" (i.e. ensemble of X number of objects, with a number of images, and a link to the corresponding page on OpenContext). Other things such as Trench Books on the other hand might be interesting to have represented as individual objects (with, say, one representative image URL published to Pelagios - but not all pages.)

Hm. Not sure. But anyways: to make things practical, we probably need to aim at a uniform approach that works across all your types of data anyway, I guess. We can always look into tweaking it later.

Is there perhaps an object type vocabulary in OpenContext we could use to divide things up? And then expose the ensemble of all objects in each category, at one place as one Pelagios object? ("Pottery at...", "Trench Books at...", etc.). Plus add a simple textual description and one or a handful of images?

Conal-Tuohy commented 8 years ago

To reiterate my comment on Twitter: I'd recommend an approach where you dump all your JSON-LD graphs into a SPARQL graph store, and then write some SPARQL CONSTRUCT queries to mediate between your RDF and Pelagios's RDF both in terms of RDF vocab and also in terms of being able to dynamically choose an appropriate granularity for a given site and object type (whatever rules you and Pelagios agree on). I recommend the approach because it loosely couples the two systems, using technology that is designed precisely for that purpose.

ekansa commented 8 years ago

@rsimon (1) On the side issues, yes, we have LOTS of entities that essentially make up a gazetteer in themselves. I do want to expose those to Pelagios also.

(2) OK! Yes, we do have a controlled vocabulary for item types, so I can easily generate items that reflect an ensemble of different types that relate to each gazetteer referenced entity in Open Context. I'll wait for advice from everyone else about exactly what should be offered to Pelagios on an item-by-item basis and what should be offered in ensemble groupings.

@Conal-Tuohy: I need to seriously consider this approach. I'm not maintaining a triple store at the moment for Open Context since for a variety of reasons (easier tracking of provenance, versioning, certainty, internal admin maintenance, etc.) we don't use triples as our metamodel internally. Currently we use Postgres and Apache-Solr (for indexing and queries). The main emphasis on RDF for us centers on interfacing with the outside world.

Setting up a triple store would pretty involved, since it would essentially replicate all the data we have in Postgres all over again. At the same time, offering s SPARQL endpoint to Open Context may be useful to a few people (I have no idea how many, we're only slowly getting more use of our GeoJSON API). And we could use the same thing to expose data for Pelagios as you say.

I'm quite sure I can generate the Pelagios annotations via our JSON API (run by Solr) quite easily. The OAI-PMH service is also based on the JSON API. So, I'm hesitant to set up more infrastructure to maintain a triple store and keep it in synch with everything else unless I'm sure there will be clear demand for its use beyond Pelagios. Thoughts?

Conal-Tuohy commented 8 years ago

A SPARQL graph store is an extra component, and obviously that is automatically a cost that counts against it (though e.g. deploying fuseki.war is not terribly involved). To me the crucial advantage is to move the "translation" issues (including fiddling with granularity) into a different conceptual space (a space of RDF graphs) where an appropriate technique (SPARQL queries) can be used to express a solution. The comparative advantage of the SPARQL language for that task (simplicity, clarity, brevitt, and modifiability) accrue from SPARQL's appropriateness to that problem domain (graph processing), and my thinking was that that advantage would more than compensate for the additional costs of maintaining one more copy of your data in yet another kind of data store.

The value of a SPARQL-based solution depends on how complex the OC→Pelagios translation would be, but given what @rsimon says above it sounds like there are a number of issues to grapple with and a feeling that some amount of experimentation might be needed; these both strengthen the case for it, I'd suggest.

rsimon commented 8 years ago

@Conal-Tuohy That could certainly be an option. But I agree that the cost/benefit ratio depends entirely on how complex the translation would be.

The Pelagios model is really simple: hardly more than a flat list of object metadata records with place references attached to them. If the OpenContext data is already in a form where the conversion step is, perhaps, just something like a simple aggregation/faceting operation in SOLR, then the benefit of a small (XSL or scripted) transform will almost certainly outweigh the added flexibility of a full extract-transform-load setup & SPARQL. (In fact I think there are several partners now that produce Pelagios RDF directly from SOLR AFAIK.)

A different question, of course, is whether you (@ekansa) see other uses for SPARQL in OC beyond Pelagios... as far as that is concerned, I fully agree that running a triple store nowadays isn't such a big deal anymore. (You'll still need to set up the pipline that transforms the OC data to RDF in the first place, obviously. But that might work along similar ways as the Pelagios export.)

FWIW:

The main emphasis on RDF for us centers on interfacing with the outside world.

In fact it works exactly the same way with all the stuff I've been building for Pelagios. Some combination of SOLR/Lucene/Postgres/ElasticSearch underneath the hood, and "RDF at the edges".

Conal-Tuohy commented 8 years ago

@rsimon the OC data is already expressed in RDF using JSON-LD; that's in fact what prompted my suggestion of SPARQL as the transformation language. But I do agree it's a judgement call based on weighing up a bunch of factors, one of which is to what extent a SPARQL store would be useful and usable for othet purposes. I would be surprised if there weren't some other interesting questions you could ask of the data via a SPARQL query endpoint.

rsimon commented 8 years ago

Ah missed the point of OC already being available as JSON-LD. Indeed - that makes the transfer OC->triple store straightforward. I agree there'll certainly be interesting uses for SPARQL'ing into OC data!

ekansa commented 8 years ago

@Conal-Tuohy @rsimon

All good points. I'm looking into using one of our servers to run Jena (or just deploying a new server for that). As an extension of capabilities it may be really useful. So, I'm reading documentation now.

In the short term, the main issue will be loading some 1 million + JSON-LD documents into Jena. That will take some time (as I know from doing the same thing with Solr, such involves crawling all of our JSON-LD docs, and mapping each to our Solr schema for indexing). At least Jena would just read the JSON-LD with less complexity than Solr indexing.

If playing with Jena gives me some grief, I may start by hacking together something in Python using our current search API to spit out Pelagios annotations.

Conal-Tuohy commented 8 years ago

@ekansa are you looking at Fuseki? If not, it's worth a look. Fuseki is the Jena component that implements the (RESTful) SPARQL Graph Store Protocol. Fuseki version 2 comes packaged as a Java WAR file. With the GSP, you can just use HTTP PUT to store each graph. You should I think be able to just take the URI of each JSON-LD graph, construct a new URI by URL-encoding it and appending it to the URI of the graph store (something like "http://example.com/fuseki/data?graph="), and then doing an HTTP PUT of the JSON-LD to that URI. I know Fuseki is supposed to suppory JSON-LD along with other RDF formats, so it SHOULD be as simple as that).

You may well be onto it already, but I thought I'd put in a plug for fuseki (in contrast to Jena's Java APIs) because it is quite convenient and will be very like how you're using Solr already).

ekansa commented 8 years ago

@Conal-Tuohy on yes! I saw that in your original comment on Twitter. The approach you mentioned is the best. I still need a crawler grap all the JSON-LD documents, but I do already have one for Solr indexing.

Still thinking about how to do all of this. Got side-tracked with meetings however...

ekansa commented 8 years ago

Hi all, OK. I've got some work done on getting Open Context expressing data for Pelagios consumption. Here is a directory where I've put some draft examples and explanations: https://github.com/ekansa/open-context-py/tree/master/pelagios-examples

pelagios / pelagios-cookbook

Advice: Open Context data for Pelagios #3