hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
9 stars 7 forks source link

Enrich with ToCs #457

Open dr0i opened 7 years ago

dr0i commented 7 years ago

A desideratum is to enrich the resource's data with keywords extracted from ToCs (Table of Content). See e.g. http://digitool.hbz-nrw.de:1801/webclient/DeliveryManager?pid=7240453&custom_att_2=simple_viewer (source:http://lobid.org/resources/HT019337607). Because of the unsupervised and non-normative ("just plain literals") extraction the enrichment should be excludable when querying the index. One way would be to subsume the data under subject field and type it differently. As we also transport the provenance information source this should also be sufficient to exclude/include the querying by will. Because of the provenance information it is overt for consumers how reliable or so the data is.

dr0i commented 7 years ago

What about not extraction of keywords, but simply to add the toc as a blob in a toc field? This would be braindead simple to do. Then, by default this field should be excluded.

acka47 commented 7 years ago

Regarding the concrete addition, I propose something like this for the example you already mentioned (I took the ocr text and replaced newlines by spaces):

{
   "tableOfContents":[
      {
         "id":"http://d-nb.info/1132221323/04",
         "label":"http://d-nb.info/1132221323",
         "ocr":"Inhalt  Einleitung 1. 2.  9  3. 4. 5.  6. 7. 8. 9. 10. 11. 12.  13. 14.  15.  Kein Endkampf 1918 18 Gewalt und die große Angst vom November 1918 Der Liebknecht-Mythos 74 Blutiger Freitag 95 Blutweihnacht 117 Der 29. Dezember 1918 136 Der Januaraufstand 149 »Die Stunde der Abrechnung naht« 176 Die ersten Gräuel: der 11. Januar 1919 190 Karl Liebknecht und Rosa Luxemburg 212 Der Märzaufstand 237 Schießbefehl 254 Gustav Noske, der Held 276 Geiselmord in München 293 Gesellenmord in München 314 331  Epilog  Danksagung 344 Abkürzungsverzeichnis 348 Karten von Kiel, München und Berlin Anmerkungen 353 Bibliographie 414 Personenregister 430  349"
      }
   ]
}
fsteeg commented 7 years ago

(I took the ocr text and replaced newlines by spaces) [...] "ocr" : "Inhalt..."

Just to mention two minor details at this point: we should not lose the newline information, and I think we should not call it "ocr" (but e.g. "text"), since the content may come from other sources (in the future).