hbz / digitalisiertedrucke

Implements http://digitalisiertedrucke.de/
0 stars 0 forks source link

Index data into embedded Elasticsearch index #1

Closed fsteeg closed 8 years ago

fsteeg commented 8 years ago

Converted the data to JSON using the existing morphs from @dr0i and indexed into Elasticsearch.

I don't quite understand what the three morphs are about (collection, title-print, title-digital). I've transformed the data with each of the morphs, resulting in 3 documents (of type collection, title-print, title-digital) for each of the 491032 source records. See:

http://quaoar1.hbz-nrw.de:6001/ (hbz-internal link)

This certainly isn't right. What's the idea here? Are we trying to extract the collections from records? And what are the different types of title data (print, digital) about?

acka47 commented 8 years ago

We have three types in this data: digital items are the base type. They are part of a collection of digitized items. Furthermore each digital item is result of the digitization of a print item (which has another creation and creator than the digital item). There is a wikipage with the visualization of an example at https://wiki1.hbz-nrw.de/display/SEM/Visualizing+hbz-zvdd+LOD+transformation.

Thus, it is ok that we have the same numbe rof print and digital items. But there shouldn't be as much collections....

acka47 commented 8 years ago

In lobid we currently list both publication dates (of original and digitized resource) in one property. There is still https://github.com/lobid/lodmill/issues/730 open for doing this in another way.

fsteeg commented 8 years ago

I've changed the collection morph to generate the document ID from field 992, subfield a (if that matches collection.*), resulting in 147 documents of type collection:

http://quaoar1.hbz-nrw.de:6011/digitalisiertedrucke/collection/_search?q=*

acka47 commented 8 years ago

This sounds better. Looking at the data, I see that periods are replaced by underscores in the property URIs, e.g. http://purl_org/dc/terms/language. This should be simple keys mapped to URIs in a JSON context.

Also, I will have to take a look at the properties used. Made an issue for this at #3.

acka47 commented 8 years ago

Just take the suffix as key, I will adjust this as needed when doing #3.

acka47 commented 8 years ago

I just saw another problem: The IDs/URIs for the resources (currently ~rdf:subject) as well as the links between the resources (isFormat, hasFormat, isPartOf) don't work because no absolut URIs are used yet. E.g.:

{
   "~rdf:subject":"resource:D28022",
   "http://purl_org/dc/terms/isPartOf":"collection:zvdd.hbz.k.de",
   "http://www_w3_org/1999/02/22-rdf-syntax-ns#type":"http://purl.org/dc/terms/BibliographicResource",
   "http://purl_org/dc/elements/1_1/publisher":"RWTH Aachen; Universitätsbibliothek Johann Christian Senckenberg Frankfurt a. M.; Bibliothek Germania Judaica Köln",
   "http://lobid_org/vocab/lobid#fulltextOnline":"http://www.compactmemory.de/index_p.aspx?tzpid=20&ID_0=20&ID_1=392&ID_2=7453&ID_3=68191",
   "id":"28022",
   "http://www_w3_org/2004/02/skos/core#Concept":"http://iflastandards.info/ns/isbd/terms/mediatype/T1002",
   "http://purl_org/dc/terms/isFormatOf":"resource:P28022",
   "http://purl_org/dc/terms/created":"2000-2005"
}

Maybe just use http://digitalisiertedrucke.de/resource/$id for now...

fsteeg commented 8 years ago

Collecting subtasks:

fsteeg commented 8 years ago

Deployed to staging, see: http://quaoar1.hbz-nrw.de:6001/

acka47 commented 8 years ago

ransform multiple values as entity[]

the metafacture fix for this results in problems for lobid-organisations. @SBRitter will open an issue for this.

acka47 commented 8 years ago

Looks good for a first version. I already adjusted two obvious errors in the morph and will look at the details after my holidays. One thing to be corrected: Creators and their website's URL are listed in the creator array of a collection, e.g.:

"creator": [
  "Berlin-Brandenburgische Akademie der Wissenschaften",
  "http://bibliothek.bbaw.de"
]

This should rather be an object:

"creator": {
  "name": "Berlin-Brandenburgische Akademie der Wissenschaften",
  "url": "http://bibliothek.bbaw.de"
}

We might even use lobid-organisations URIs as id, e.g.:

"creator": {
  "id": "http://beta.lobid.org/organisations/DE-B4#!",
  "name": "Berlin-Brandenburgische Akademie der Wissenschaften",
  "url": "http://bibliothek.bbaw.de"
}
fsteeg commented 8 years ago

Depends on https://github.com/hbz/metafacture-core/issues/9, additional task in https://github.com/hbz/digitalisiertedrucke/issues/1#issuecomment-235875854, moving to ready.

fsteeg commented 8 years ago

The metafacture-core issue is fixed, and creators (as well as contributors) are transformed to entities with name and url literals, see:

http://quaoar1.hbz-nrw.de:6001/

To add the IDs to these entities, we need the ISILs, which are not part of the source data. If we really want this, we should open a separate issue for it.

acka47 commented 8 years ago

+1

fsteeg commented 8 years ago

Deployed to a separate internal instance: http://quaoar1.hbz-nrw.de:5000/ (Firefox blocks port 6000).