hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
16 stars 4 forks source link

Adjust transformation rules to RDA changes #161

Closed acka47 closed 7 years ago

acka47 commented 9 years ago

From 1 October 2015 people will be cataloging in hbz union catalog according to the RDA rules as documented here. We will have to adjust the transformation, i.e. the hbz01-to-lobid morph file accordingly.

After a first cursory look at the documents, I suggest the following approach:

Identifying RDA records RDA is only implemented to newly catalogued resources which get an RDA marker r in field 030, indicator=blank,position 4 of the Aleph sequentials (aseq), see the documentation. Thus, we will have to add the RDA transformation rules only for these records.

Checking fields that will be omitted Several fields won't be used anymore with RDA cataloging. You can see the list here. We will check whether and how we currently transform these to RDF.

Find out how to transform the new data to lobid'scurrent RDF data model After having identified the data fields where RDA means change we will have to find find out how we integrate the new RDA data into the the current lobid RDF.

Discuss how to handle breaks in the cataloging practice While we be able to make a seemless transformation for some of RDA cataloging so that lobid customers won't even notice that things have changed, this may not be possible for all of the changes. E.g., regarding IMD (Inhaltstyp, Medientyp, Datenträgertype)/CMC (content type, media type, carrier type) we will get better and more coherent information (see here for details).

On cases where cataloging practice significantly breaks, we will have to look, whether we will both try to map the data to the old/currrent data model and map the data according to RDA.

acka47 commented 9 years ago

Here is a list of RDA record in lobid courtesy of @donboern: rda-ids.txt

acka47 commented 9 years ago

Most of the reosurces listed in rda-ids.txt seem to be periodicals. Here is a print book: http://lobid.org/resource/HT018779822

acka47 commented 9 years ago

What seems important to me for the start is field 419 with the pbulisher, publication place/date information.

Snippet from http://lobid.org/resource/HT018779822:

<datafield ind2="1" ind1="-" tag="419">
            <subfield code="a">New York</subfield>
            <subfield code="b">Routledge</subfield>
            <subfield code="c">2014</subfield>
</datafield>

Snippet from http://lobid.org/resource/HT018772912:

          <datafield ind2="1" ind1="-" tag="419">
            <subfield code="a">Sundern</subfield>
            <subfield code="b">Baulmann Leuchten GmbH</subfield>
            <subfield code="c">2011-</subfield>
            <subfield code="A">3</subfield>
          </datafield>
dr0i commented 8 years ago

Thus, we will have to add the RDA transformation rules only for these records.

Does this mean that fields are ambiguous (i.e. e.g. 419-1c is the publication date if it's RDA catalogued but something different when it's old MAB2? (In this case I can see that that's not the case)). If there is no interference it's much simpler to configure the transformation rules, that's why I ask.

acka47 commented 8 years ago

@droi Could you please get the Aleph XML source of all files in rda-ids.txt and put them in one file so that I can search for specific fields?

dr0i commented 8 years ago

for i in $(cat rda-ids.txt); do xmllint --format "$i?format=source" >> rda-ids.alephMabXmlPretty.xml; done You find that at http://lobid.org/download/rda-ids.alephMabXmlPretty.xml .

acka47 commented 8 years ago

Speaking to publisso stakeholders, they want to work with roles of persons/corporations from RDA. We will have to consider these in the transformation. Note to self: Take a look at this and open a separate issue.

dr0i commented 8 years ago

Updated rda-ids.alephMabXmlPretty.xml . Took as base DE-605-aleph-baseline-marcxchange-2016011515.tar.gz which reveals 16k resources as RDA. Hope this suffices.

acka47 commented 8 years ago

@dr0i Could you please update rda-ids.alephMabXmlPretty.xml once more?

dr0i commented 8 years ago

Around 180k docs, concatenated in one big bzipped xml file: http://lobid.org/download/rda-ids.alephMabXmlPretty.xml.bz2

acka47 commented 8 years ago

Thanks. Unwieldy as the file gets, I won't ask again for creating it. Now thinking about how to work with a 1,5GB xml file...

dr0i commented 8 years ago

Depending on what you want, you can always use the friendly stream tools like less, grep, sed etc.

acka47 commented 8 years ago

There seems to be a problem with the rda-ids.alephMabXmlPretty.xml. When I do for example cat rda-ids.alephMabXmlPretty.xml | xmllint --format - | grep --color -A 4 "<datafield tag=\"064\" ind1=\".\" ind2=\".\">" I get:

-:103: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0" encoding="UTF-8"?>
     ^
-:104: parser error : Extra content at the end of the document
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.o
ChristophEwertowski commented 7 years ago

As a finger execise I looked at the morph-hbz01-to-lobid.xml to check fields which are now omitted, how they are transformed and to document it here.

Number MAB2 Field name MAB2 if & how transformed to RDF
300 Sammlungsvermerk -
304 Einheitssachtitel a → dc/terms/alternative
310 Hauptsachtitel in Ansetzungsform → Titel
333 zu ergänzende Urheber zum Hauptsachtitel If no title exists, set as title. Also taken as CorporateBodyTitle.
334 Allgemeine Materialbenennung Match with Bibo/AudioDocument, bibo/AudioVisualDocument, bibo/Image, RDACarrierType/1020 (Microform Carriers). Used for checking if full text is online.
340, 344, 348, 352 Parallelsachtitel in Ansetzungsform -
342, 346, 350, 354 zu ergänzende Urheber zum Parallelsachtitel -
361 Beigefügte Werke -
410, 411, 412, 415, 416, 417, 418 Alter Erscheinungsvermerk -
454, 464, 474, 484, 494 Gesamttitel in Ansetzungsform – wird auf Verbundebene entschieden! -
502 Einheitssachtitel eines beigefügten oder kommentierten Werkes -
504 Angabe von Paralleltiteln → dc/terms/alternative
517 Angaben zum Inhalt -
519 Alter Hochschulschriftenvermerk If existing, multiple values are combined as RDA Elements/u/P60489
532 Hinweise auf frühere und spätere sowie zeitweise gültige Titel -
610 – 645 Segment Sekundärformen 619a (Erscheinungsjahr(e) in Vorlageform) matched with 021 (Identifikationsnummer der Primaerform)
652 Spezifische Materialbenennung und Dateityp a (stands for RAK-NBM) → Online ressource
653 Physische Beschreibung der Computerdatei auf Datenträger -
8XX Segment Nichtstandardmäßige Nebeneintragungen Matches with some GND-id?
9XX Bei RSWK-Schlagwörtern erstes Unterfeld $f Matches with some GND-id?
acka47 commented 7 years ago

Closing this super-issue as the two remaining sub-issues are sufficient for future orientation (and don't need to be implemented for the launch).