uwlib-cams / MARC2RDA

mapping between MARC21 and RDA-RDF
Creative Commons Zero v1.0 Universal
32 stars 2 forks source link

264 production, publication, distribution, manufacture, and copyright notice #129

Closed CECSpecialistI closed 5 days ago

CECSpecialistI commented 2 years ago

https://docs.google.com/spreadsheets/d/14L9J5gNaThD73s5-zxkC9QVAuIkG2GhLS4r7GqNUYmo/edit#gid=2053543820

CECSpecialistI commented 2 years ago

From Crystal 2022-05-15:

The 264 is looking good, and I think your mappings make sense for native LRM/RDA/RDF descriptions, but I am uncertain about the mapping in the MARC-to-RDA context based on the mappings we've done up to this point.

We have decided (based on advice from Gordon and practices from the RSC) that in cases where a transcribed field gives a label for an intermediate object from another rda:Class, we would give the unstructured description as the direct value of the property.

So, for example, rather than:

[Manifestation]-->rdam:P30279-->[Place]--> rdap:P70001 -->[Nomen]-->rdan:P80057-->[value of $a] , for a native LRM/RDA/RDF description,

We would use the following for a MARC-generated description (because we want to avoid proliferation of blank nodes and cannot rely on the values in a transcription field to generate IRI's):

[Manifestation]-->rdam:P30279-->[value of $a]

But, in this case, the property, rdam:P30279, has a defined range for rda:Place. And then, the property rdap:P70001 has a defined range of Nomen. I think we have used literals as direct values for properties without a specified range, but I am uncertain about extending that policy to properties that do have defined ranges. If it's ok with you, I think this would be a good question to revisit at a meeting, since it will require us to decide outside the guidance provided in the RDA registry/toolkit.

CECSpecialistI commented 2 years ago

From Sofia 2022-05-15: thank you for your answer, the rdamd:P30279 is a datatype property so we can use it as it is. What we can discuss is if we should write in the mapping that we use the datatype (rdamd) or the object property (rdamo).

GordonDunsire commented 2 years ago

For linked data mappings, it is more precise, and potentially less lossy, to use the RDA datatype and object properties to distinguish between string appellations and thing IRIs as the value of a related RDA entity.

Strictly speaking, the canonical properties have no range.

Thus:

[Manifestation]-->rdamd:P30279-->[value of $a]

This statement entails:

[Manifestation]-->rdam:P30279-->[value of $a]

because of the subProperty relationship between the datatype element and the canonical element.

It is not feasible to use the object property:

[Manifestation]-->rdamo:P30279-->[Place]-->rdapd:70001-->[value of $a]

or

[Manifestation]-->rdamo:P30279-->[Place]-->rdapo:70001-->[Nomen]-->rdand:P80057-->[value of $a]

because, as both examples in the MARC 21 Bib Manual show, we cannot disambiguate the place string in $a:

Boston (Mass., USA) vs Boston (England) Cambridge (Mass., USA) vs Cambridge (England)

We therefore cannot assign an IRI to [Place], so it becomes a blank node, and we want to avoid that.

RDA Toolkit gives explicit instructions about recording a place as a nomen string or as an IRI (blank or not); see https://access.rdatoolkit.org/en-US_ala-125eb55c-1dd0-31d1-a45b-2469dc36b400/div_mcb_4s4_gdb

Here, the 'value of Place: appellation of place' is the place string in $a.

szapoun commented 2 years ago

Thank you for the comments @GordonDunsire!

So, I will use datatype properties and declare this in the mapping using rdamd .

How does that sound?

In the meeting of 1st June, it was decided to use unstructured descriptions and datatype properties. Always select property that its semantics imply the use of literal (if we intend to use literla as value). As an example, "has name of publisher" versus "has publisher person". The first property implies the use of a string, while the latter the use of an instance of the Person entity. Ofcourse, the dataype property "has publisher person" can be used with a string, but since another property with explicit semantics exist, use that one (the "has name of publisher" property in the example).

@GordonDunsire @AdamSchiff can you please check what I have written to be sure I have noted correctly what we talked about yesterday? Thank you

CECSpecialistI commented 2 years ago

Are there instructions somewhere on when and how to use datatype properties in RDA? A number of us (myself included) do not have a firm grasp on how to use these.

szapoun commented 2 years ago

The main difference between datatype properties and object properties is that the former relate an instance of a class (in the case of 264 an instance of the manifestation entity) to a literal/text. Object properties relate two instances of classes, e.g. an instance of the Manifestation entity to an instance of the Person entity.

example:M12345 (instance of Manifestation entity)--> rdamd:P30176 (has name of publisher) -->"John Doe"

example:M12345 (instance of Manifestation entity) --> rdamo:P30362 (has publisher person) --> example:P5678 (instance of Person entity) --> rdaad:P50111 (has name of person) --> "John Doe"

In this example the datatype properties are rdamd:P30176 & rdaad:P50111, while the object property is the rdamo:P30362.

Hope this helps :)

pan-zhuo commented 2 years ago

If the datatype element implies the canonical element, it doesn't seem wrong to me to use the canonical one, especially when RDA explicitly says we can record a nomen string for an entity as a value of the canonical element?

Will implementing datatype/object elements be an extra layer of complexity?

GordonDunsire commented 2 years ago

The term 'canonical' only appears in the Toolkit in the context of sources of information. There are no references to canonical, datatype, or object elements. "Domain" is defined as "The RDA entity that is described by an element"; "range" is defined as "The RDA entity that is the value of a relationship element". These are not restricted to the technical RDF definitions because RDA has to accommodate all four implementation scenarios, and not just linked open data. The Toolkit is aimed at metadata creators and managers, not developers of applications.

The RDA Registry is aimed at developers. The Guide for developers explains the RDF view of canonical, datatype, and object elements in the "RDA elements" section. For further information, you will have to dive into OWL. Be aware of the general confustion between 'datatype property' (what we are talking about) and 'datatype' (the kind of string) in Google searches.

The reasons for using datatype and object elements are:

  1. It is a trivial machine process to convert statements that use datatype and object predicates to statements that use canonical predicates: remove "datatype" or "object" from the IRI of the predicate:

http://rdaregistry.info/Elements/w/datatype/P10429

-> http://rdaregistry.info/Elements/w/P10429

  1. It saves significant processing time in an application in determing if a value is a string or a potentially de-referencable thing.

exW1 rdawd:P10429 someIRI -> exW1 rdawd:P10429 "someIRI" (the thing is stringified) exW2 rdawo:P10429 "someIRI" -> exW2 rdawo:P10429 someIRI (the string is thingified)

[cf the problem with quotes and spreadsheets ...]

  1. Using an object predicate allows an application to entail the class of the related entity, but this cannot be done without additional processing if the canonical predicate is used.

exW2 rdawo:P10429 someIRI => someIRI isA rdac:C10004 (RDA Person) => exW2 isA rdac:C10001 (RDA Work)

@pan-zhuo: If an object predicate is used for this purpose (3.), it is consistent to use a datatype predicate when "someIRI" is a literal (string). If the mapping 'knows' that the value is a string, then using a datatype element embeds this knowledge. There is no additional layer of complexity (1.).

I suspect that this topic will be important when mapping $0, which may be a string or a thing. For $1, it is definitely a thing, so why not say so?

@szapoun: structured descriptions and identifiers are also accommodated in datatype elements. In particular, be careful about using only unstructured descriptions: they are primarily used for manifestation statements and notes, and it is assumed that only general keyword indexing can be applied to the values. They are the 'dumbest' (least 'smart') values that are accommodated by RDA. Of course, if the source data isn't 'smart', this is the best you can do.

CECSpecialistI commented 2 years ago

Thanks for your comments, everyone! I understand the issue much more clearly now, and am in favor of using datatype and object properties when they are appropriate. The work we have done so far in recording method identification should make any retrospective changes easier to make than they would have been, and seem worthwhile if they will reduce processing time in implementation. Let's discuss at the meeting next week?

CECSpecialistI commented 2 years ago

See decision: https://github.com/uwlib-cams/MARC2RDA/wiki/Decisions-Index#ic3-datatypeobject-properties

gerontakos commented 1 year ago

Field is fully coded; however, we did not follow the spreadsheet: the spreadsheet used soft-deprecated properties (see see https://www.rdaregistry.info/Aligns/alignSoft2Rec.html) when $a or $b contained an = sign (i.e. had parallel statements). Coders went ahead and used the "RecommendedLabel" in place of the "RedundantLabel." The mapping (i.e. the spreadsheet) was not changed. As a result, we will not close this issue yet. We are also tagging this issue "meeting discussion needed." Also, this was added to the agenda for our meeting scheduled 2022-09-28.

CECSpecialistI commented 1 year ago

Field is fully coded; however, we did not follow the spreadsheet: the spreadsheet used soft-deprecated properties (see see https://www.rdaregistry.info/Aligns/alignSoft2Rec.html) when $a or $b contained an = sign (i.e. had parallel statements). Coders went ahead and used the "RecommendedLabel" in place of the "RedundantLabel." The mapping (i.e. the spreadsheet) was not changed. As a result, we will not close this issue yet. We are also tagging this issue "meeting discussion needed." Also, this was added to the agenda for our meeting scheduled 2022-09-28.

Discussed during meeting 2022-09-28: Zhuo and Theo are right, and soft-deprecated properties should be updated according to https://www.rdaregistry.info/Aligns/alignSoft2Rec.html

@szapoun volunteered to do this spreadsheet update via email on 2022-09-28

szapoun commented 1 year ago

@CECSpecialistI I have updated the spreadsheet for both 260 and 264. Added in the comments the change. I also changed status to first pass.

CECSpecialistI commented 1 year ago

Thank you @szapoun ! @gerontakos @pan-zhuo does the spreadsheet now match the transform code? If so, I can mark this one "done".

gerontakos commented 1 year ago

My claim is yes, yes they match. I will now "Clode with comment; I believe that will auto-mark it as "Done."

CECSpecialistI commented 1 year ago

@szapoun we discovered during 008 mapping review today that we used a soft-deprecated property for copyright date, which should be manifestation copyright statement for the 264-4 $c. you mentioned you wanted to go back in and fix it. this comment is just a reminder <3 Thank you for your work!

szapoun commented 1 year ago

@CECSpecialistI @gerontakos Thanks for the reminder Crystal! I checked both 260 and 264 and I saw that I had already changed the mapping in late October. The coding though from Theo was done in late September. So, an update must be done at the coding level. Thank you, S.

CECSpecialistI commented 1 year ago

@gerontakos @pan-zhuo, @szapoun has just double-checked to make sure copyright dates for 264-4 and 260 are mapped to manifestation copyright statement rather than the soft-deprecated property "copyright date" in the mapping spreadsheets. Would one of you check to make sure the transformation (which was created when the spreadsheet was still using the soft-deprecated property) reflects the change? Thank you all for your work!

gerontakos commented 1 year ago

264-4 is accurately coded. 260 has not been coded yet, it is still "in progress" in the mapping.

CECSpecialistI commented 2 months ago

@lake44me The mappings marked "first pass" can be updated to "reviewed" based on conversations had in 2022

cspayne commented 1 month ago

@GordonDunsire

I have updated the mapping and the code based on what you have described in the tiny dataset and what I could find in old meeting notes and issues.

264 Test input and output are available, if you could take a look when you have time.

The changes made were:

I did not include any test data that used '='. There is code to account for this case, but I could not find any example MARC and did not make any changes to that aspect of the code.

GordonDunsire commented 1 month ago

@cspayne: I think it's better to use the MARC 21 manual text 'Copyright notice date' as the prefix in note on manifestation. This will preserve a distinction with the label of the RDA element, and I presume it will be consistent with using MARC 21 manual terminology for note prefixes.

I think the approach of bullet two should be revised. The examples include $c Haziran 2021, which is a month-year in Turkish, and more generally, the current approach would not record a date triple for 'January 2024', etc. Instead, is it possible for the transform to look for a 'naked' 4-digit year in the value, and if found, discard everything else? If not found, strip surround brackets and repeat the test, so that '[January 2024]' transforms to '2024'? This will result in what appear to be 'false' dates: Zamistān-i 1400 [2021 or 2022] transforms to '1400', but in pure RDA cataloguing the calendar would be a separate data provenance triple (which is beyond the scope of the transform). As with the other values, the original is preserved in the statement or note triples.

cspayne commented 1 month ago

@GordonDunsire

We can revise the approach for bullet two, but I think it may require more thought and consideration. For example, what do we do with: 264 #1 $a[Pullman, Washington] : $bCenter for Northwest Anthropology, $c[1995 or 1996]

If we strip the surrounding brackets, we would end up with two dates. Is something like this an outlier or do we need to account for it?

Looking at 264 $c values, I also came across some with multiple copyright dates in one subfield $c, is this an error garbage-in-garbage-out thing, or something we should account for?

i.e. 264 #4 $c©2004, 2001, 1998

GordonDunsire commented 1 month ago

@cspayne: I think we need a wider group discussion on this, but these examples are ok if the transform removes all non-numeric characters from the value and then finds the first occurrence of a 4-digit year.

[1995 or 1996] => 1995 1996 => 1995 (c)2004, 2001, 1998 => 2004 2001 1998 => 2004 1400 [2021 or 2022] => 1400 2021 2022 => 1400

The year can then be output as the object string of the appropriate RDA element.

Does this work for values that are estimated, etc.? Note that date ranges such as 1994-1996 are transformed to the first year of the range, but that is not incorrect.

I don't think the multiple copyright dates is an error. It is a typical 'copyright notice' that indicates a copyright history or copyright for different (sub) units of the manifestation. The MARC 21 semantics for subfield $c are incoherent: either a date or a notice that covers multiple dates. RDA has an option to record '(c)2004, 2001, 1998' as an unstructured description of 'has copyright date', so that is another option for the transform. Note that it is ok in RDA to record an earlier copyright date, so if the notice listed the dates in inverse order, the value '1998' for 'has copyright date' is not incorrect.

In short, the transform options for subfield $c and similar subfields are:

More than one option can be applied. Structured and unstructured values are for RDA datatype properties, but a structured year date can have an xsd date/time qualifier added to make the distinction.

It would be great to transform a year date to an object, say using wikibase IRIs, but this is thwarted by the use of different calendar systems.

GordonDunsire commented 1 month ago

However, this may all be subsumed by the values of dates in field 008. If there is a year date in 008, there may be no need to extract it from field 264.

cspayne commented 1 month ago

@GordonDunsire @CECSpecialistI @lake44me @szapoun @tmqdeborah

It looks like 008 provides date of manifestation or date of publication, but can't be mapped to any more specific properties. With 264, ind2 provides more specific detail on what the date in $c is for (production, publication, distribution, manufacture, copyright notice date). If 008 provides a date of publication, it makes sense to not map 264 ind2 = 1 $c values, because they will already be in 008, but for other ind2 values, won't 264 provide more specific information if we can extract the dates?

CECSpecialistI commented 2 weeks ago

I think we answered these questions during a recent meeting. Remove "code on hold" tag? Are we good to go @cspayne ?

cspayne commented 2 weeks ago

@CECSpecialistI

I think we answered these questions during a recent meeting. Remove "code on hold" tag? Are we good to go @cspayne ?

I believe the last question I asked still hasn't been answered. Once that is determined, then i can go in and update the code.

GordonDunsire commented 2 weeks ago

@cspayne: If this is the unanswered question: 'If 008 provides a date of publication, it makes sense to not map 264 ind2 = 1 $c values, because they will already be in 008, but for other ind2 values, won't 264 provide more specific information if we can extract the dates?'

Yes, we should prefer the field 264 data over the field 008 data for this and other reasons, but only when the 264 data is a plain 4-digit year that matches the 008 data. If the 264 data is complex, as discussed above, then the 008 data may contain information that is not recorded in 264. such as the intellectual extraction of the date year by the cataloguer.

I think this implies that the transform should process all of the 008 and 264 data without trying to avoid duplicate or contradictory statements.

dchen077 commented 6 days ago

@cspayne Reproduction conditions has been added for 264!

cspayne commented 5 days ago

Code is updated for reproduction conditions.