hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
16 stars 4 forks source link

Get GND labels from base data #139

Closed acka47 closed 9 years ago

acka47 commented 9 years ago

Currently, we are enriching the title data with GND labels using hadoop job. There are at least two problems with this approach: #84 and one problem not documented appearing after the last morph adjustment.

To avoid these problems and reduce transformation time, we will get the labels directly out of the Aleph XML using morph.

Amongst others, we need to know:

acka47 commented 9 years ago

Subject headings and preferred labels are in 902, alternate labels in 952. To find out of which type a GND entity is, you have to take a look at the indicator of 902. From the MAB documentation:

902       KETTENGLIED DER 1. SCHLAGWORTKETTE

          Indikator:
          p     = Personenschlagwort
          g     = geographisch-ethnographisches Schlagwort
          s     = Sachschlagwort
          k     = Koerperschaftsschlagwort: Ansetzung unter dem
                  Individualnamen
          c     = Koerperschaftsschlagwort: Ansetzung unter dem
                  Ortssitz
          z     = Zeitschlagwort
          f     = Formschlagwort
          t     = Werktitel als Schlagwort
          blank = Unterschlagwort einer Ansetzungskette
acka47 commented 9 years ago

Example 1 (without contributor and with only one subject headings type): http://lobid.org/resource/HT010726584

Desired outcome is to have the preferred names as usual associated with the GND objects and the alternate names along witht eh prefered names in field subjectLabel to allow querying by all labels:

{
  "@graph" : [ {
    "@id" : "http://d-nb.info/gnd/4046259-6",
    "preferredName" : "Plasmaphysik",
    "preferredNameForTheSubjectHeading" : "Plasmaphysik"
  }, {
    "@id" : "http://d-nb.info/gnd/4067488-5",
    "preferredName" : "Zeitschrift",
    "preferredNameForTheSubjectHeading" : "Zeitschrift"
  }, {
    "@id" : "http://d-nb.info/gnd/4511937-5",
    "preferredName" : "Online-Publikation",
    "preferredNameForTheSubjectHeading" : "Online-Publikation"
  }, {
    "@id" : "http://dewey.info/class/530/",
    "prefLabel" : [ {
      "@language" : "en",
      "@value" : "Physics"
    }, {
      "@language" : "de",
      "@value" : "Physik"
    } ]
  }, {
    "@id" : "http://lobid.org/resource/HT010726584",
    ...
    "subject" : [ "http://d-nb.info/gnd/4067488-5", "http://dewey.info/class/530/", "http://d-nb.info/gnd/4046259-6", "http://d-nb.info/gnd/4511937-5" ],
    "subjectLabel" : [ "On-line-Dokument", "Online-Dokument", "On-line-Publikation", "Online-Ressource", "Computerdatei im Fernzugriff (Formschlagwort)", "Netzpublikation", "Zeitschriften", "Online-Datenbank (Formschlagwort)", "Periodikum", "On-line-Datenbank (Formschlagwort)" ],
   ...
   } ]
...
}

Aleph XML (snippet):

...
<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Plasmaphysik</subfield>
<subfield code="9">(DE-588)4046259-6</subfield>undefined</datafield>undefined<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Zeitschrift</subfield>
<subfield code="9">(DE-588)4067488-5</subfield>undefined</datafield>undefined<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Online-Publikation</subfield>
<subfield code="9">(DE-588)4511937-5</subfield>undefined</datafield>
...
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Computerdatei im Fernzugriff</subfield>
    <subfield code="h">Formschlagwort</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Online-Datenbank</subfield>
    <subfield code="h">Formschlagwort</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Online-Dokument</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">On-line-Datenbank</subfield>
    <subfield code="h">Formschlagwort</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">On-line-Dokument</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Online-Ressource</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">On-line-Publikation</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Netzpublikation</subfield>
</datafield>

The implementation looks quite straightforward. For subjectLabel take all entries for 902 und 952, for preferredName only take 902.

acka47 commented 9 years ago

Example 2 (with corporate body as contribtuor and three different types of subject headings): http://lobid.org/resource/HT013077595/about

Desired outcome:

{
  "@graph" : [ {
    "@id" : "http://d-nb.info/gnd/109490312",
    "preferredName" : "Boer, Hans-Peter",
    },
    "preferredNameForThePerson" : "Boer, Hans-Peter"
  }, {
    "@id" : "http://d-nb.info/gnd/11079267X",
    "preferredName" : "Balke, Kirsten",
    "preferredNameForThePerson" : "Balke, Kirsten"
  }, {
    "@id" : "http://d-nb.info/gnd/128755-2",
    "preferredName" : "Kreisheimatverein <Coesfeld>",
    "preferredNameForTheCorporateBody" : "Kreisheimatverein <Coesfeld>"
  }, {
    "@id" : "http://d-nb.info/gnd/4010355-9",
    "preferredName" : "Coesfeld",
    "preferredNameForThePlaceOrGeographicName" : "Coesfeld"
  }, {
    "@id" : "http://d-nb.info/gnd/4010356-0",
    "preferredName" : "Kreis Coesfeld",
    "preferredNameForThePlaceOrGeographicName" : "Kreis Coesfeld"
  }, {
    "@id" : "http://d-nb.info/gnd/4024116-6",
    "preferredName" : "Heimatkundeunterricht",
    "preferredNameForTheSubjectHeading" : "Heimatkundeunterricht"
  }, {
    "@id" : "http://lobid.org/resource/HT013077595",
    "contributorLabel" : [ "Balke, Kirsten", "Boer, Hans Peter", "Boer, Hans-Peter" ],
    "subjectLabel" : [ "Coesfeld. Hauptamt", "Landkreis Coesfeld", "Kreis Coesfeld. Kreistag", "Kreis Coesfeld. Hauptamt", "Kosfel'd", "Kreis Coesfeld. Oberkreisdirektor", "Coesfeld (Kreis)", "Kreis Coesfeld. Landrat", "Landrat (Kreis Coesfeld)", "Oberkreisdirektor (Kreis Coesfeld)", "Kreisverwaltung (Kreis Coesfeld)", "Kreistag (Kreis Coesfeld)", "Heimatkunde (Unterricht)", "Hauptamt (Kreis Coesfeld)", "Heimatkundedidaktik", "Stadtdirektor (Coesfeld)", "Pressestelle (Coesfeld)", "Hauptamt (Coesfeld)", "Coesfeld. Pressestelle", "Coesfeld. Stadtdirektor", "Heimatkunde / Didaktik", "Stadt Coesfeld", "Kreis Coesfeld. Kreisverwaltung" ],
    "contributor" : [ "http://d-nb.info/gnd/11079267X", "http://d-nb.info/gnd/128755-2", "http://d-nb.info/gnd/109490312" ],
    "subject" : [ "http://d-nb.info/gnd/4010355-9", "http://d-nb.info/gnd/4024116-6", "http://d-nb.info/gnd/4010356-0" ],
"subjectChain" : [ "Coesfeld | Heimatkundeunterricht | Lehrmittel", "Kreis Coesfeld | Heimatkundeunterricht | Lehrmittel (213)", "Kreis Coesfeld | Heimatkundeunterricht | Lehrmittel", "Coesfeld | Heimatkundeunterricht | Lehrmittel (213)" ],
   ...
   }]
...
}

Source data (snippet):

<datafield tag="104" ind1="b" ind2="1">
    <subfield code="p">Boer, Hans-Peter</subfield>
    <subfield code="d">1949-</subfield>
    <subfield code="b">[Red.]</subfield>
    <subfield code="9">(DE-588)109490312</subfield>
</datafield>
<datafield tag="105" ind1="-" ind2="1">
    <subfield code="p">Boer, Hans Peter</subfield>
    <subfield code="d">1949-</subfield>
</datafield>
<datafield tag="200" ind1="b" ind2="1">
    <subfield code="k">Kreisheimatverein</subfield>
    <subfield code="h">Coesfeld</subfield>
    <subfield code="9">(DE-588)128755-2</subfield>
</datafield>
<datafield tag="331" ind1="-" ind2="1">
    <subfield code="a">Geschichte hier</subfield>
</datafield>
...
<datafield tag="902" ind1="-" ind2="1">
    <subfield code="g">Coesfeld</subfield>
    <subfield code="9">(DE-588)4010355-9</subfield>
</datafield>
<datafield tag="902" ind1="-" ind2="1">
    <subfield code="s">Heimatkundeunterricht</subfield>
    <subfield code="9">(DE-588)4024116-6</subfield>
</datafield>
<datafield tag="902" ind1="-" ind2="1">
    <subfield code="f">Lehrmittel</subfield>
</datafield>
...
<datafield tag="902" ind1="-" ind2="1">
    <subfield code="s">Heimatkundeunterricht</subfield>
    <subfield code="9">(DE-588)4024116-6</subfield>
</datafield>
<datafield tag="902" ind1="-" ind2="1">
    <subfield code="f">Lehrmittel</subfield>
</datafield>
<datafield tag="903" ind1="-" ind2="1">
    <subfield code="a">213</subfield>
</datafield>
<datafield tag="907" ind1="-" ind2="1">
    <subfield code="g">Kreis Coesfeld</subfield>
    <subfield code="9">(DE-588)4010356-0</subfield>
</datafield>
<datafield tag="907" ind1="-" ind2="1">
    <subfield code="s">Heimatkundeunterricht</subfield>
    <subfield code="9">(DE-588)4024116-6</subfield>
</datafield>
<datafield tag="907" ind1="-" ind2="1">
    <subfield code="f">Lehrmittel</subfield>
</datafield>
<datafield tag="908" ind1="-" ind2="1">
    <subfield code="a">213</subfield>
</datafield>
<controlfield tag="SYS">011404221</controlfield>
<datafield tag="LOW" ind1="-" ind2="1">
    <subfield code="a">M0001</subfield>
</datafield>
<datafield tag="LOW" ind1="-" ind2="1">
    <subfield code="a">M1168</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="k">Coesfeld</subfield>
    <subfield code="b">Hauptamt</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="k">Hauptamt</subfield>
    <subfield code="h">Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="k">Coesfeld</subfield>
    <subfield code="b">Stadtdirektor</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="k">Stadtdirektor</subfield>
    <subfield code="h">Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="k">Coesfeld</subfield>
    <subfield code="b">Pressestelle</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="k">Pressestelle</subfield>
    <subfield code="h">Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="g">Kosfel'd</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="g">Stadt Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Heimatkunde</subfield>
    <subfield code="h">Unterricht</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Heimatkunde</subfield>
    <subfield code="x">Didaktik</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
    <subfield code="s">Heimatkundedidaktik</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreis Coesfeld</subfield>
    <subfield code="b">Oberkreisdirektor</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Oberkreisdirektor</subfield>
    <subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreis Coesfeld</subfield>
    <subfield code="b">Kreistag</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreistag</subfield>
    <subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreis Coesfeld</subfield>
    <subfield code="b">Hauptamt</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Hauptamt</subfield>
    <subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreis Coesfeld</subfield>
    <subfield code="b">Landrat</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Landrat</subfield>
    <subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreis Coesfeld</subfield>
    <subfield code="b">Kreisverwaltung</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="k">Kreisverwaltung</subfield>
    <subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="g">Landkreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="g">Coesfeld</subfield>
    <subfield code="h">Kreis</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="s">Heimatkunde</subfield>
    <subfield code="h">Unterricht</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="s">Heimatkunde</subfield>
    <subfield code="x">Didaktik</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
    <subfield code="s">Heimatkundedidaktik</subfield>
</datafield>
acka47 commented 9 years ago

As we currently do, we should record the preferred Name in the RDF using both the general and the more specific property, e.g.:

    "@id" : "http://d-nb.info/gnd/4076769-3",
    "preferredName" : "Römerzeit",
    "preferredNameForTheSubjectHeading" : "Römerzeit"

Mapping the subfields from https://github.com/hbz/lobid/issues/139#issuecomment-94394197 to RDF properties, respectively their JSON object keys:

p: preferredNameEntityForThePerson g: preferredNameForThePlaceOrGeographicName s: preferredNameForTheSubjectHeading k: preferredNameForTheCorporateBody c: :question: z: No specific properties as these aren't GND entities, thus are not linked and only occur as part of a subject chain in RDF. f: same as for z. t: preferredNameForTheWork (:exclamation: We have to be careful here as subdfiled t co-occurs with subfield p, see e.g. http://lobid.org/resource?id=HT018312899&format=source. For the start, we should map to preferredNameForTheWork if t occurs and prefix the creator name followed by colon and space (see e.g. http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT018312899 for implementation).

acka47 commented 9 years ago

Regarding subfield c, can you point me to an example, @dr0i?

dr0i commented 9 years ago

http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT014280388

acka47 commented 9 years ago

t: preferredNameForTheWork (:exclamation: We have to be careful here as subdfiled t co-occurs with subfield p, see e.g. http://lobid.org/resource?id=HT018312899&format=source. For the start, we should map to preferredNameForTheWork if t occurs and prefix the creator name followed by colon and space (see e.g. http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT018312899 for implementation).

At the NWBib meeting, customers asked for GND work titles having the author name in the label (see https://wiki1.hbz-nrw.de/x/DQBEB). Example: http://lobid.org/resource?id=HT018312899&format=full

Instead of:

{

    "@id": "http://d-nb.info/gnd/7683386-0",
    "preferredName": "Der Cid",
    "preferredNameForTheWork": "Der Cid"

}

it should look like this:

{

    "@id": "http://d-nb.info/gnd/7683386-0",
    "preferredName": "Grabbe, Christian Dietrich: Der Cid",
    "preferredNameForTheWork": "Der Cid"

}
dr0i commented 9 years ago

Ready for testing. E.g. http://lobid.org/resource/HT007496264 vs http://test.lobid.org/resource/HT007496264 Transformation and indexing for all 20M docs (resulting in 66M docs) took 14h (formerly, with hadoop: 35h). Missing yet: enrichment with openlibrary, dbpedia and gutenberg. Made a ticket for this: lobid/lodmill/#667).

literarymachine commented 9 years ago

I believe that restricting the type of a resource is now broken, e.g. http://test.lobid.org/resource?name=Tom%2BSawyer&from=0&size=10&type=http%3A%2F%2Fpurl.org%2Fontology%2Fbibo%2FBook returns resoruces that are not bibo:Book (e.g. http://lobid.org/resource/HT016678345).

dr0i commented 9 years ago

@literarymachine last commits (fixing the index config) seems to fix this problem. Test it, even better with the following API call which results in 15 hits: http://test.lobid.org/resource?name=Tom%2BSawyer%20detective&from=0&size=50&type=http%3A%2F%2Fpurl.org%2Fontology%2Fbibo%2FBook Yields the same results as the lobid productive.

acka47 commented 9 years ago

edit dr0i: made a new issue hbz/lobid#150.

acka47 commented 9 years ago

edit dr0i: put that comment into new issue hbz/lobid#150.

acka47 commented 9 years ago

EDIT dr0i: made new issue #149.

dr0i commented 9 years ago

Deployed to staging and production. @acka47 please have a look. Mind also comment in lobid/lodmill#669.

acka47 commented 9 years ago

We can close this one as we have this in production and there probably only will be some minor adjustments in the future