uwlib-cams / MARC2RDA

mapping between MARC21 and RDA-RDF
Creative Commons Zero v1.0 Universal
33 stars 2 forks source link

RDA/LRM/RDF data produced for the MARC2RDA project #390

Closed gerontakos closed 1 year ago

gerontakos commented 1 year ago

Hello all, a dataset has been produced for the group to review. It is in github at MARC2RDA/Working Documents/transformationCode/outputDataForReview. There is the input MARC dataset, from which an RDF dataset was derived and saved in two versions, one without labels and another with labels. The one with labels is probably easier to review; that's dataset-1-withLabels.rdf. You can review it before the upcoming meeting or we can review it together at our Wednesday morning meeting on April 26. Please remember only a few fields have been sent to output. We thought it might be good to review some data before we code more fields, make sure we're on the right track.

pan-zhuo commented 1 year ago

Source MARC data is now placed alongside corresponding RDA data. You can search for 'Fxxx' ('xxx' for MARC field tags, F245, F500, etc.) to get to the part you'd like to review. An index to the fields represented in the 2 datasets:

F020 F043 F245 F264 F306 F336 F337 F338 F340 F380 F382 F490 F500 F502 F504 F561 F880

image

Example: searching for 'F336' content type

GordonDunsire commented 1 year ago

Here's my initial thoughts on dataset 1. Overall: looking good.

More specific comments, in order of appearance in the file:

  1. <fake:rdawP10065>880-01 ʻAbd al-Razzāq, Zaynab, author.</fake:rdawP10065>

Is this a temporary transform artefact?

  1. <rdamd:P30004>(ISBN) 9789777952569 (paperback)</rdamd:P30004><!--rdamd:P30004 = has identifier for manifestation-->

This is a result of applying this option. But ISBN Registry is a recognized VES, so the preceding option and is better (). This would result in "9789777952569".

  1. <rdamd:P30134>Iḥsān ʻAbd al-Qaddūs</rdamd:P30134><!--rdamd:P30134 = has title of manifestation-->

This is not necessary because it can be entailed from the next statement:

<rdamd:P30156>Iḥsān ʻAbd al-Qaddūs</rdamd:P30156><!--rdamd:P30156 = has title proper-->

  1. <rdamd:P30111>$a al-Qāhirah : $b al-Dār al-Miṣrīyah al-Lubnānīyah, $c 2020.</rdamd:P30111><!--rdamd:P30111 = has publication statement-->

It is not necessary to include the M21 encoding; value should be: "al-Qāhirah : al-Dār al-Miṣrīyah al-Lubnānīyah, 2020"

  1. <rdamd:P30002>n</rdamd:P30002><!--rdamd:P30002 = has media type-->

Notation from id.loc.gov vocabulary: why? The data is already in the M21 record, and the code is meaningless to an end-user. If the display is to be suppressed, how does the system know which value to suppress? Why not publish mappings between the RDA and M21 vocabularies? (Ask RSC Tech WG to do it, or send them a draft).

Supplementary: why not record the RDA IRI for the RDA media type in place of, or in addition to, the preferred label: `unmediated</rdamd:P30002>

rdamt:1007 // cannot use datatype or object element` 6. `Includes bibliographical references (pages 236-238).` Bibliographical references are, themselves, not distinct expressions, so the RDA element used for this example is incorrect. This is not the same situation as an index, which derives its content from another expression (that is indexed). There is sufficient ambiguity in the M21 manual to warrant mapping to rdamd:P30137 (has note on manifestation); M21 mixes "bibliography" (a separate work/expression with "bibliographic references" (an integral part of a scholarly work/expression), and is happy to use 500 for some cases. = = = 7. `F245 10 $a Developing digital project delivery routines around frequent disruptions : $n numer 8 $c Hamid Abdirad. Developing digital project delivery routines around frequent disruptions` $n should be part of the title: `Developing digital project delivery routines around frequent disruptions, numer 8` ($p is also part of the title when present, and comes after $n data if present.) See also Manifestation entity output. 8. `F502 ## $b Ph. D. $c University of Washington $d 2020.` Should also map to: `Ph. D., University of Washington, 2020` 9. `F264 #1 $a [Seattle] : $b [University of Washington Libraries], $c [2020]` RDA: "This element is a superelement composed from values of one or more of the following subelements" So the derivation process is: 1. Establish the values of the subelements 2. Apply a string encoding scheme to those values to obtain the value of the superelement For 1, remove the square brackets; this is ISBD punctuation that indicates provenance (not found in manifestation): `Seattle` For 2, do not add square brackets: `Seattle : University of Washington Libraries, 2020` = = = 10. `The Uyghur genocide : a psychological perspective / Mamtimin Ala` This is incorrect (data that follows "=" in $c; should be parsed out of the subfield and treated as a single element, but which one?) = = = 11. `[Place of publication not identified]` I don't like this (reinforced by working with ISBD), and think the RDA instructions require amendment. Specifically, this option (https://access.rdatoolkit.org/en-US_ala-b569c353-c792-3617-8c83-d974466ccc02/p_mts_hh3_cmb) should be replaced by: Record "place of publication not identified" as the value of note on manifestation. Ditto similar options for date of publication and name of publisher, and distribution, manufacture, and production. = = = 12. `1400 [2021 or 2022]` This conforms with [this RDA option](https://access.rdatoolkit.org/en-US_ala-26b84eaa-054e-3c8c-8933-8760b9b2046f/p_i4p_pmp_cmb), but I do not think the option is applicable because we cannot or have not fulfilled the second part of the option, to record the provenance. Although the use of brackets is the ISBD way of "indicating provenance", it is excluded from official RDA (and the draft ISBD for Manifestation discontinues such usage). I suggest adding a condition to remove bracketed data from this field when recording a date: `1400` This would cascade upwards to has publication statement ... (see 9.) If we want to retain the bracketed data, I suggest mapping it to a note: `Date of publication: 2021 or 2022` The boilerplate might be more meaningful; e.g. "Date of publication in Gregorian or Julian calendar:"
AdamSchiff commented 1 year ago

Gordon wrote:

<fake:marcfield<F245 10 $a Developing digital project delivery routines around frequent disruptions : $n numer 8 $c Hamid Abdirad.</fake:marcfield> rdawd:P10088Developing digital project delivery routines around frequent disruptions</rdawd:P10088>

$n should be part of the title: <rdawd:P10088>Developing digital project delivery routines around frequent disruptions, numer 8</rdawd:P10088>

($p is also part of the title when present, and comes after $n data if present.)

I don't disagree with Gordon. As shown, the punctuation is not correct and should be Developing digital project delivery routines around frequent disruptions. $n Numer 8

BUT: where is Numer 8 coming from? I just looked at the OCLC record for this thesis and there is no such data:

100 1 Abdirad, Hamid, ǂe author. ǂ1 http://www.wikidata.org/entity/Q101242292 24510 Developing digital project delivery routines around frequent disruptions : ǂb how do AEC organizations respond to disruptive information exchange requirements? / ǂc Hamid Abdirad. 264 1 [Seattle] : ǂb [University of Washington Libraries], ǂc [2020] 264 4 ǂc ©2020

It appears that the record has been corrupted somehow in your transformation. "numer 8" is not present at all in the OCLC record that we created. If this is corrupted, then there is a good chance that other data is as well.

GordonDunsire commented 1 year ago

Here are my comments on Dataset 2:

  1. `F245 10 $a Concertino per tromba e orchestra (2015) = $b für Trompete und Orchester = for trumpet and orchestra / $c Krzysztof Penderecki.</fake:marcfield> für Trompete und Orchester = for trumpet and orchestra`

Is this correctly encoded in MARC? The non-repeat of "Concertino" is a problem.

  1. `F264 #1 $a Mainz : $b Schott, $c [2017]</fake:marcfield> [2017]`

I think we should strip off the brackets. They indicate that the data was taken from outside of the manifestation being described, which is provenance that is no longer relevant to retain.

  1. `F264 #4 $c ©2017</fake:marcfield> ©2017`

I think 264 2nd ind = 4 should transform to rdamd:P30007 "has copyright date". The copyright symbol, the phonogram symbol, the string "(c)", the string "(p)", the string "copyright", the string "phonogram copyright", the letter "c", or the letter "p" should be stripped from the value: <rdamd:P30007>2017</rdamd:P30007><!--rdamd:P30007 = has copyright date-->

  1. <rdaed:P20215>For trumpet; piano [includes percussion staff]. Total performers: 2.</rdaed:P20215><!--rdaed:P20215 = has medium of performance of musical content-->

I think parentheses should be used in place of brackets. Brackets have a specific, albeit legacy, meaning and we should avoid potential confusion if we can.

5.<fake:marcfield>F264 #2 $a [Milwaukee, Wisconsin] : $b distributed in North and South America exclusively by Hal Leonard.</fake:marcfield>

`[Milwaukee, Wisconsin] : distributed in North and South America exclusively by Hal Leonard.</rdamd:P30108>

[Milwaukee, Wisconsin] distributed in North and South America exclusively by Hal Leonard.` Remove the brackets around dates and places in timespan and place elements, but retain in "statement" elements: `[Milwaukee, Wisconsin] : distributed in North and South America exclusively by Hal Leonard. // this is ok Milwaukee, Wisconsin` 6. `F490 1# $a Trumpet library = $a Trompeten-Bibliothek ; $v TR 30 Trumpet library Trompeten-Bibliothek ; TR 30` There should be a standard "statement" transform for the whole MARC 490 field, to produce: `Trumpet library = Trompeten-Bibliothek TR 30` I think we should remove the contents of subfields l (LC call number), y (invalid ISSN), and z (cancelled ISSN) [treat $3, $7 as per general decision ...]. Retain the punctuation; remove the subfield encoding. 7. `F500 ## $a This library lacks distributor information. $5 WaU` This note is odd? A general note says the distributor info is taken from a label (presumably affixed by the seller) on the back cover, so does this note really say the WaU copy lacks the book jacket? 8. `F245 10 $a Onlinedating.teenadultdating/Adult-dating / $c Angela Genusa. OnlinedatingteenadultdatingAdult-dating` I think the embedded punctuation should not be stripped out: `Onlinedating.teenadultdating/Adult-dating` [Anomalous anyway, because the hyphen is left in.] Cf "has title of manifestation", which does not strip out the punctuation. 9. `F264 #1 $a [Place of publication not identified] : $b [Angela Genusa], $c 2012. [Place of publication not identified] : [Angela Genusa], 2012. [Place of publication not identified] [Angela Genusa]` The standard "statement" output is fine. The standard value "[Place of publication not identified]" does not map to rdamd:P30088 = has place of publication. It should only be mapped into the statement output. The same applies to other standard "not identified" values. Strip the brackets from name of publisher: `Angela Genusa` 10. `F500 ## $a UW Libraries has an unbound copy of single leaves, starting with the title page and ending at page 121. $5 WaU` Another strange "holdings" note. If the manifestation being described as "volume" as a carrier type, the UW note is referring to a separate manifestation, presumably a photo-reproduction or print-off, that is unbound and possibly incomplete. The transform is ok.
briesenberg07 commented 1 year ago

GitHub Markdown information
To use the # or @ character without inserting a link to an issue or user, respectively, these can be escaped with a preceding backslash (\). Enclosing in backticks (``) as I've done here works too.

CECSpecialistI commented 1 year ago

2023-05-03 discussed until #7 in Gordon's comment