hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
16 stars 4 forks source link

Take 064 into account for transformation to RDF #266

Closed acka47 closed 7 years ago

acka47 commented 8 years ago

Sub-issue of #161. The "Formschlagwörter" are in field 064 in RDA instead of being listed with the other subject headings.

Examples

http://lobid.org/resource/HT017458093 which has Formschlagwort ""Zeitung" but isn't typed as such yet:

          <datafield ind2="1" ind1="a" tag="064">
            <subfield code="a">Zeitung</subfield>
            <subfield code="9">(DE-588)4067510-5</subfield>
          </datafield>

http://lobid.org/resource/HT018781721 (snippet) which has Formschlagwort "Zeitschrift" and is already typed as bibo:Journal:

          <datafield ind2="1" ind1="a" tag="064">
            <subfield code="a">Zeitschrift</subfield>
            <subfield code="9">(DE-588)4067488-5</subfield>
          </datafield>
          <datafield in

http://lobid.org/resource/HT018772904 (Formschlagwort "Bibliographische Reihe" and already typed as bibo:Series):

          <datafield ind2="1" ind1="a" tag="064">
            <subfield code="a">Monografische Reihe</subfield>
            <subfield code="9">(DE-588)4179998-7</subfield>
          </datafield>
acka47 commented 8 years ago

We might consider aligning the RDF for pre-RDA and RDA records by removing "Formschlagwörter" from the subject array for pre-RDA records. See also https://github.com/hbz/lobid-rdf-to-json/issues/23#issuecomment-243483195.

acka47 commented 8 years ago

As nobody asked for this, I'd say it is sufficient to do this in API 2.0. Thus, adding the label.

acka47 commented 8 years ago

Here is the core list of GND Formschlagwörter: http://access.rdatoolkit.org/document.php?id=nlgpschp7&target=nlgps07-27

Here is the extended list with all GND Formschlagörter (PDF): https://wiki.dnb.de/download/attachments/106042227/AH-007.pdf

acka47 commented 8 years ago

There is redundant information in MAB/Aleph fields 051/052. I wonder whether infromation in 051 is generated automatically from the 064 information or not (which would mean that it might even contradict each other). From the MAB documentation:

051     VEROEFFENTLICHUNGSSPEZIFISCHE ANGABEN ZU BEGRENZTEN
        WERKEN

          Indikator:
          blank = nicht definiert

          Datenelemente:
            0  Erscheinungsform
               a = unselbstaendig erschienenes Werk
               f = Fortsetzung
               m = einbaendiges Werk - nicht Teil eines
                   Gesamtwerks
               n = mehrbaendiges begrenztes Werk - nicht Teil
                   eines Gesamtwerks
               s = einbaendiges Werk  u n d  Teil (mit
                   Stuecktitel) eines Gesamtwerks
               t = mehrbaendiges begrenztes Werk  u n d
                   Teil (mit Stuecktitel) eines Gesamtwerks

          1-3  Veroeffentlichungsart und Inhalt
               a = Abstract (Referat)
               b = Bibliographie
               c = Katalog
               d = Woerterbuch
               e = Enzyklopaedie
               f = Festschrift
               g = Datenbank
               h = Biographie
               i = Registerwerk
               j = Fortschrittsbericht
               k = Konferenzschrift
               l = Gesetz
               m = Musikalia
               n = Normschrift
               o = Loseblattausgabe
               p = Patentdokument
               q = Lieferungswerk
               r = Report
               s = Statistik
               t = Aufsatz
               u = Universitaetsschrift
               v = Sonderdruck
               x = Schulbuch
               z = sonstige Veroeffentlichungsart/-inhalt

Some examples to take a closer look at: http://lobid.org/hbz01/HT019025947, http://lobid.org/hbz01/HT019025943, http://lobid.org/hbz01/HT018814546, http://lobid.org/hbz01/HT018913029, http://lobid.org/hbz01/HT018909174

ChristophEwertowski commented 7 years ago

By testing I found out that fields 051/052 aren't automatically generated from 064. For the core Formschlagwörter I took the first five hits of lobid.org/resource and looked at them at lobid.org/hbz01. For Autobiografie, Bibliografie, Biografie, Comic, Festschrift, Hochschulschrift, Hörbuch, Schulbuch, Website and Zeitschrift I found no file with a field 064.

acka47 commented 7 years ago

@ChristophEwertowski We have to check the RDA titles to see whether the 051/052 are automatically generated from 064. RDA are those with creation date after 2015-10-01. You can limit a query to those using the Elasticsearch query DSL, see https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-uri-request.html

acka47 commented 7 years ago

@dr0i showed me how to limit the queries to those created after a specific point in time using the URL. E.g. http://lobid.org/resources?q=describedby.dateCreated:%3E20151001

ChristophEwertowski commented 7 years ago

I confined my search to October 2015 and onwards and looked at it again. There are still cases where 064 doesn't exist but 051 does, so for these cases 051 isn't automatically generated from 064. Example: http://lobid.org/hbz01/HT018979011

In other cases both fields exist but contain different information. Example: http://lobid.org/hbz01/HT018976920 <controlfield tag="051">at||||||</controlfield> which means "unselbstaendig erschienenes Werk, Aufsatz".

<datafield tag="064" ind1="a" ind2="1">
    <subfield code="a">Biografie</subfield>
    <subfield code="9">(DE-588)4006804-3</subfield>
    <subfield code="y">1921-1978</subfield>
</datafield>

Since a biography could also exist in other forms, e.g. books, for this case 051 couldn't be generated from 064.

ChristophEwertowski commented 7 years ago

The Formschlagwörter are still apart from the other keywords in field 064 and not 952 (see first post from Nov. 2015) (example: http://lobid.org/hbz01/HT019016389). So also they are not in subjectLabels (http://lobid.org/resource/HT019016389/about).

And if you look closer at the first example you can see that in the hbz01 file it's described as a newspaper (http://lobid.org/hbz01/HT017458093, field 064) and in the lobid-resource as a journal (http://lobid.org/resource/HT017458093, type:bibo/Journal) which are two different publication types.

acka47 commented 7 years ago

And if you look closer at the first example you can see that in the hbz01 file it's described as a newspaper (http://lobid.org/hbz01/HT017458093, field 064) and in the lobid-resource as a journal (http://lobid.org/resource/HT017458093, type:bibo/Journal) which are two different publication types.

The example you point to has p in 052 at position 0 which is – correctly – transformed to type "Journal". Thus, this rather seems a cataloging error to me.

Source data:

<controlfield tag="052">pag||||aw||||||</controlfield>

From the MAB documentation:

052       VEROEFFENTLICHUNGSSPEZIFISCHE ANGABEN ZU FORTLAUFENDEN
          SAMMELWERKEN

          Indikator:
          blank = nicht definiert

          Datenelemente:
            0  Erscheinungsform
               a = unselbstaendig erschienenes Werk
               f = Fortsetzung
               j = zeitschriftenartige Reihe
               p = Zeitschrift
               r = Schriftenreihe (Serie)
               z = Zeitung
ChristophEwertowski commented 7 years ago

To get back, I sum up which points are open: Do we really need Formschlagwörter?

Are the fields 051/052 derived from 064 for RDA? (Probably not.) @acka47 which person would be the right contact person?

I'm going to tackle the first question by looking which and how much Formschlagwörter are already represented by mapping of 050-052.

ChristophEwertowski commented 7 years ago
  1. As acka47 said: They contain different content and should be kept in our data.
acka47 commented 7 years ago

R.D. (Edoweb) just asked for the 064 in an email:

wir bemerken eben erst, daß die Marc-Kat. 064 nicht in der Lobid-Schnittstelle und damit auch nicht ins Edoweb transportiert wird. Beisp.: image001 Darin sind wichtige Informationen für die Sacherschließung. Können Sie sagen, ob das ein Versäumnis ist und ob man das nachholen kann?

Here is a link to the example from the screenshot: http://lobid.org/resources/HT019149667

acka47 commented 7 years ago

I think it will be hard to align 064 ("Nature of Content"/"Art des Inhalts", see ) with the information we already have about a resource from other fields (inlcuding Formschlagwörter). Thus, it might be the easiest way to just add 064 independently to the RDF. The fitting property from the RDA registry is http://rdaregistry.info/Elements/u/P60584 "has nature of content". I couldn't find controlled vocabulary for the values. It looks like the controlled value list is DACH-specific and thus it's not surprising.

acka47 commented 7 years ago

I couldn't find controlled vocabulary for the values.

As there are GND URIs given (I already linked to the PDF above that also lists the GND URIs), we will just use these along with the label given in subfield a, e.g. for the example:

{
   "@context":"http://lobid.org/resources/context.jsonld",
   "id":"http://lobid.org/resources/HT019149667#!",
   "natureOfContent":[
      {
         "id":"http://d-nb.info/gnd/4048476-2",
         "label":"Ratgeber"
      },
      {
         "id":"http://d-nb.info/gnd/4142300-8",
         "label":"Amtliche Publikation"
      }
   ]
}
ChristophEwertowski commented 7 years ago

NatureOfContent is added. Example (production) / example (test).

acka47 commented 7 years ago

Looks good.+1

dr0i commented 7 years ago

Deplyoed to prodcution, closing.