hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
25 stars 5 forks source link

Test serialization of GND RDF-XML to compact JSON-LD #1

Closed fsteeg closed 7 years ago

fsteeg commented 7 years ago

Both dumps and updates (via OAI) are available as RDF-XML, so that would be a suitable source format:

http://datendienst.dnb.de/cgi-bin/mabit.pl?userID=opendata&pass=opendata&cmd=login http://www.dnb.de/DE/Service/DigitaleDienste/OAI/oai_node.html (s. "Formate")

We should test serializing that RDF-XML as compact JSON-LD using the entityfacts context:

http://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld http://hub.culturegraph.org/entityfacts/118540238

If the result looks good, this might be the format to index in Elasticsearch. We might have to do some preprocessing to make sure the values always have the same type (see footnote 1 in http://blog.lobid.org/2017/06/08/lobid-api-why-how.html about compact JSON-LD serialization in Elasticsearch).

acka47 commented 7 years ago

For testing the quality of the JSON-LD output you should take a look at entities with geo coordinates (which are added via a bnode). For example http://d-nb.info/gnd/4074335-4 (ttl). See the issue at https://github.com/lobid/lodmill/issues/503.

fsteeg commented 7 years ago

First results, for http://d-nb.info/gnd/2047974-8/about/lds:

{
  "@graph" : [ {
    "@id" : "http://d-nb.info/gnd/2047974-8",
    "@type" : "organisation",
    "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/4194078-7" ],
    "http://d-nb.info/standards/elementset/gnd#broaderTermInstantial" : [ {
      "@id" : "http://d-nb.info/gnd/4630294-3"
    } ],
    "http://d-nb.info/standards/elementset/gnd#geographicAreaCode" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-DE-NW"
    } ],
    "http://d-nb.info/standards/elementset/gnd#gndIdentifier" : [ "2047974-8" ],
    "http://d-nb.info/standards/elementset/gnd#gndSubjectCategory" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/gnd-sc#6.7"
    }, {
      "@id" : "http://d-nb.info/standards/vocab/gnd/gnd-sc#2.2"
    } ],
    "homepage" : [ {
      "@id" : "https://www.hbz-nrw.de/"
    } ],
    "http://d-nb.info/standards/elementset/gnd#oldAuthorityNumber" : [ "(DE-588)4194078-7", "(DE-588b)2047974-8", "(DE-588c)4194078-7" ],
    "placeOfBusiness" : [ {
      "@id" : "http://d-nb.info/gnd/4031483-2"
    } ],
    "preferredName:ForTheCorporateBody" : [ "Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen" ],
    "http://d-nb.info/standards/elementset/gnd#spatialAreaOfActivity" : [ {
      "@id" : "http://d-nb.info/gnd/4042570-8"
    } ],
    "topic" : [ {
      "@id" : "http://d-nb.info/gnd/4132773-1"
    } ],
    "variantName:ForTheCorporateBody" : [ "Hochschulbibliothekszentrum NRW", "Hochschulbibliothekszentrum des Landes NRW", "Hochschulbibliothekszentrum", "hbz", "hbz Köln" ],
    "http://www.w3.org/2002/07/owl#sameAs" : [ {
      "@id" : "http://d-nb.info/gnd/4194078-7"
    } ],
    "url" : [ {
      "@id" : "http://de.wikipedia.org/wiki/Hochschulbibliothekszentrum_des_Landes_Nordrhein-Westfalen"
    } ]
  } ]
}

For http://d-nb.info/gnd/4074335-4/about/lds:

{
  "@graph" : [ {
    "@id" : "_:t1",
    "@type" : "http://www.opengis.net/ont/sf#Point",
    "http://www.opengis.net/ont/geosparql#asWKT" : [ {
      "@type" : "http://www.opengis.net/ont/geosparql#wktLiteral",
      "@value" : "Point ( -000.125740 +051.508530 )"
    } ]
  }, {
    "@id" : "http://d-nb.info/gnd/4074335-4",
    "@type" : "http://d-nb.info/standards/elementset/gnd#TerritorialCorporateBodyOrAdministrativeUnit",
    "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/1005809-6" ],
    "http://d-nb.info/standards/elementset/gnd#definition" : [ {
      "@language" : "de",
      "@value" : "Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
    } ],
    "http://d-nb.info/standards/elementset/gnd#geographicAreaCode" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-GB"
    } ],
    "http://d-nb.info/standards/elementset/gnd#gndIdentifier" : [ "4074335-4" ],
    "homepage" : [ {
      "@id" : "http://www.london.gov.uk"
    } ],
    "http://d-nb.info/standards/elementset/gnd#oldAuthorityNumber" : [ "(DE-588)1005809-6", "(DE-588b)1005809-6", "(DE-588c)4074335-4" ],
    "preferredName:ForThePlaceOrGeographicName" : [ "London" ],
    "http://d-nb.info/standards/elementset/gnd#relatedDdcWithDegreeOfDeterminacy4" : [ {
      "@id" : "http://dewey.info/class/2--421/"
    } ],
    "variantName:ForThePlaceOrGeographicName" : [ "Londinum", "Londra", "Lundonia", "Augusta Trinobantum", "Westminster", "Lundun", "Landan", "Londyn", "Londres", "Londen", "London (Great Britain)", "Londinium" ],
    "http://www.opengis.net/ont/geosparql#hasGeometry" : [ {
      "@id" : "_:t1"
    } ],
    "http://www.w3.org/2002/07/owl#sameAs" : [ {
      "@id" : "http://d-nb.info/gnd/1005809-6"
    }, {
      "@id" : "http://sws.geonames.org/2643743"
    } ]
  } ]
}
acka47 commented 7 years ago

So the geo stuff is in there. However, we will need some post- and pre-processign to get the expected results.

Pre-processing / Reasoning

In 1.0, we added some inferencing to get more general properties. I suggest doing similar things here:

  1. We don't want specific name properties like preferredNameForThePlaceOrGeographicName and variantNameForThePlaceOrGeographicName. For all entities, we should just use preferredName and variantName.
  2. We probably need to add all superclasses to the data. In this case, this would be PlaceOrGeographicName and AuthorityResource.

Having done 1.) and 2.), the result would look like this:

{
  "@graph" : [ {
    "@id" : "_:t1",
    "@type" : "http://www.opengis.net/ont/sf#Point",
    "http://www.opengis.net/ont/geosparql#asWKT" : [ {
      "@type" : "http://www.opengis.net/ont/geosparql#wktLiteral",
      "@value" : "Point ( -000.125740 +051.508530 )"
    } ]
  }, {
    "@id" : "http://d-nb.info/gnd/4074335-4",
    "@type" : [ "http://d-nb.info/standards/elementset/gnd#TerritorialCorporateBodyOrAdministrativeUnit",  "http://d-nb.info/standards/elementset/gnd#PlaceOrGeographicName", "http://d-nb.info/standards/elementset/gnd#AuthorityResource" ],
    "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/1005809-6" ],
    "http://d-nb.info/standards/elementset/gnd#definition" : [ {
      "@language" : "de",
      "@value" : "Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
    } ],
    "http://d-nb.info/standards/elementset/gnd#geographicAreaCode" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-GB"
    } ],
    "http://d-nb.info/standards/elementset/gnd#gndIdentifier" : [ "4074335-4" ],
    "homepage" : [ {
      "@id" : "http://www.london.gov.uk"
    } ],
    "http://d-nb.info/standards/elementset/gnd#oldAuthorityNumber" : [ "(DE-588)1005809-6", "(DE-588b)1005809-6", "(DE-588c)4074335-4" ],
    "http://d-nb.info/standards/elementset/gnd#preferredName" : [ "London" ],
    "http://d-nb.info/standards/elementset/gnd#relatedDdcWithDegreeOfDeterminacy4" : [ {
      "@id" : "http://dewey.info/class/2--421/"
    } ],
    "http://d-nb.info/standards/elementset/gnd#variantName" : [ "Londinum", "Londra", "Lundonia", "Augusta Trinobantum", "Westminster", "Lundun", "Landan", "Londyn", "Londres", "Londen", "London (Great Britain)", "Londinium" ],
    "http://www.opengis.net/ont/geosparql#hasGeometry" : [ {
      "@id" : "_:t1"
    } ],
    "http://www.w3.org/2002/07/owl#sameAs" : [ {
      "@id" : "http://d-nb.info/gnd/1005809-6"
    }, {
      "@id" : "http://sws.geonames.org/2643743"
    } ]
  } ]
}

Context & Framing

The result of framing the above output (based on the to-be-added AuthorityResource type) and adding the EntityFacts context can be viewed at http://tinyurl.com/y7n93utq. Obviously, this is not satsifying. For one, the EntityFacts context doesn't suffice and would have to be extended as it obviously doesn't cover the whole GND ontology. (EntityFacts os a simplification for use of GND by web developers). However, using our current context from 1.0 already looks much better, see http://tinyurl.com/ychm4t92. Thus, I suggest to just update this one.

Furthermore, the @graph is still in there after framing and has to be removed by us. (It currently isn't possible to just leave it out but will be possible with the next JSON-LD version, see this thread on the liked-json mailing list and the issue resulting from the thread.)

acka47 commented 7 years ago

I just found out that I already created a context for the 2.0 GND API, see https://github.com/hbz/lobid-gnd/issues/1. (We should probably delete this repo as soon as we have moved the issue over here.) This context is also missing some things (e.g. the geo properties), see http://tinyurl.com/y8z3f3rl.

fsteeg commented 7 years ago

Another option would be direct transformation from MARC-XML to JSON, like in lobid-organisations.

We could adapt the existing mappings for the RDF conversion: https://github.com/culturegraph/metafacture-examples/tree/master/Linked-Data-Service-Gnd

acka47 commented 7 years ago

Re. the framing output from http://tinyurl.com/ychm4t92, I just noticed that blank nodes get an id:

      "hasGeometry": {
        "@id": "_:b0",
        "@type": "http://www.opengis.net/ont/sf#Point",
        "asWKT": "Point ( -000.125740 +051.508530 )"
      }

We should get rid of them. This has already been addressed in the JSON-LD Framing spec 1.1 ("pruneBlankNodeIdentifiers") but is currently only implemented in the Ruby library, see https://github.com/json-ld/json-ld.org/issues/293.

fsteeg commented 7 years ago

Input: http://d-nb.info/gnd/4074335-4/about/lds

Context: https://gist.githubusercontent.com/acka47/98035a3f215c783bdc00/raw/5699ab4e89b5e7ab896ac69442c84fcf7f50ad66/gnd-context_20160126.jsonld

Frame: https://gist.githubusercontent.com/fsteeg/729e623e7f3c5f0003bc6f28a525d2ea/raw/4e0632608116acd043727ec45588236a98cc6eef/gnd-frame_20160126.jsonld

Output:

{
  "@id" : "http://d-nb.info/gnd/4074335-4",
  "@type" : "TerritorialCorporateBodyOrAdministrativeUnit",
  "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/1005809-6" ],
  "definition" : [ {
    "@language" : "de",
    "@value" : "Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
  } ],
  "geographicAreaCode" : [ "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-GB" ],
  "gndIdentifier" : [ "4074335-4" ],
  "homepage" : [ "http://www.london.gov.uk" ],
  "oldAuthorityNumber" : [ "(DE-588)1005809-6", "(DE-588b)1005809-6", "(DE-588c)4074335-4" ],
  "preferredNameForThePlaceOrGeographicName" : [ "London" ],
  "relatedDdcWithDegreeOfDeterminacy4" : [ "http://dewey.info/class/2--421/" ],
  "variantNameForThePlaceOrGeographicName" : [ "Londinum", "Londra", "Lundonia", "Augusta Trinobantum", "Westminster", "Lundun", "Landan", "Londyn", "Londres", "Londen", "London (Great Britain)", "Londinium" ],
  "http://www.opengis.net/ont/geosparql#hasGeometry" : [ {
    "@id" : "_:b0",
    "@type" : "http://www.opengis.net/ont/sf#Point",
    "http://www.opengis.net/ont/geosparql#asWKT" : [ {
      "@type" : "http://www.opengis.net/ont/geosparql#wktLiteral",
      "@value" : "Point ( -000.125740 +051.508530 )"
    } ]
  } ],
  "sameAs" : [ "http://d-nb.info/gnd/1005809-6", "http://sws.geonames.org/2643743" ]
}

@acka47 Except for the points you already mentioned (missing keys in context, blank node IDs) this looks OK. Did I understand correctly: the idea is to add the http://d-nb.info/standards/elementset/gnd#AuthorityResource type to all authorities?

acka47 commented 7 years ago

Yes, this already looks quite good. And yes, as in 1.0 we should add type AuthorityResource to all entitites.

Furthermore, we should have a type from the second level of GND ontology attached to each resource. We will need this for facetting. GND ontology has three levels in its type hierarchy (except for Person, where we have a fourth one added). see the overview over the GND class hierarchy at https://wiki1.hbz-nrw.de/x/CIeW. In the concrete example, PlaceOrGeographicName should be in the data.

Regarding the name properties, we should only use preferredName and variantName for all entities. This will allow us to query the whole data in a uniform way. (The type is made clear by other means so that we don't need the specific properties.)

fsteeg commented 7 years ago

Deployed current state to: http://test.lobid.org/authorities

Our London example: http://test.lobid.org/authorities/4074335-4.json

@acka47 The context is used directly from GitHub, so you can edit on GitHub to test context tweaks: https://github.com/hbz/lobid-authorities/blob/master/conf/context.jsonld

(Context content is from https://gist.githubusercontent.com/acka47/98035a3f215c783bdc00/raw/5699ab4e89b5e7ab896ac69442c84fcf7f50ad66/gnd-context_20160126.jsonld)

fsteeg commented 7 years ago

Before working on the details (2nd level superclasses, rename fields, remove blank node IDs), I suggest we continue with testing the actual indexing of this format in Elasticsearch. I'd suggest we resolve this issue, and open new issues for the things I mentioned above. Assigning @acka47 for functional review.

acka47 commented 7 years ago

I just noticed that the language isn't indicated as we do in other lobid services:

"definition":[
   {
      "@language":"de",
      "@value":"Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
   }
]

We would rather have "@container": "@language" in the context and the following in the data:

"definition":[
   {
      "de":"Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
   }
]

I updated the context accordingly but we will have to also take this into accoutn during transformation.

acka47 commented 7 years ago

I updated the context accordingly but we will have to also take this into accoutn during transformation.

Looks fine already, thus nothing more to do. (also adjusted context for biographicalOrHistoricalInformation).

We will have to find out on what other properties language tags are used.

acka47 commented 7 years ago

+1 Did some adjustments to the context and I am satisfied for now. Will open issues for the other things.

fsteeg commented 7 years ago

I don't think we need a separate beta/prod system yet, context is used from GitHub, so nothing to deploy, closing this.

Opened #5 for indexing.