Open dr0i opened 7 years ago
What about not extraction of keywords, but simply to add the toc as a blob in a toc
field? This would be braindead simple to do. Then, by default this field should be excluded.
Regarding the concrete addition, I propose something like this for the example you already mentioned (I took the ocr text and replaced newlines by spaces):
{
"tableOfContents":[
{
"id":"http://d-nb.info/1132221323/04",
"label":"http://d-nb.info/1132221323",
"ocr":"Inhalt Einleitung 1. 2. 9 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Kein Endkampf 1918 18 Gewalt und die große Angst vom November 1918 Der Liebknecht-Mythos 74 Blutiger Freitag 95 Blutweihnacht 117 Der 29. Dezember 1918 136 Der Januaraufstand 149 »Die Stunde der Abrechnung naht« 176 Die ersten Gräuel: der 11. Januar 1919 190 Karl Liebknecht und Rosa Luxemburg 212 Der Märzaufstand 237 Schießbefehl 254 Gustav Noske, der Held 276 Geiselmord in München 293 Gesellenmord in München 314 331 Epilog Danksagung 344 Abkürzungsverzeichnis 348 Karten von Kiel, München und Berlin Anmerkungen 353 Bibliographie 414 Personenregister 430 349"
}
]
}
(I took the ocr text and replaced newlines by spaces) [...] "ocr" : "Inhalt..."
Just to mention two minor details at this point: we should not lose the newline information, and I think we should not call it "ocr" (but e.g. "text"), since the content may come from other sources (in the future).
A desideratum is to enrich the resource's data with keywords extracted from ToCs (Table of Content). See e.g. http://digitool.hbz-nrw.de:1801/webclient/DeliveryManager?pid=7240453&custom_att_2=simple_viewer (source:http://lobid.org/resources/HT019337607). Because of the unsupervised and non-normative ("just plain literals") extraction the enrichment should be excludable when querying the index. One way would be to subsume the data under
subject
field and type it differently. As we also transport the provenance informationsource
this should also be sufficient to exclude/include the querying by will. Because of the provenance information it is overt for consumers how reliable or so the data is.