uwlib-cams / MARC2RDA

mapping between MARC21 and RDA-RDF
Creative Commons Zero v1.0 Universal
32 stars 2 forks source link

257 country of producing entity #123

Open CECSpecialistI opened 2 years ago

CECSpecialistI commented 2 years ago

https://github.com/uwlib-cams/MARC2RDA/blob/main/Working%20Documents/2XX.csv

cspayne commented 3 months ago

@pennylenger Hi Penny! I'm looking at row 4 in the mapping, for $0 with the condition of $1. The decision for $0s and $1s seems to indicate that $0s should not be mapped when a $1 is present. Do you think this is an exception to the rule and we should keep Zhuo's mapping of $1 rdap:P70057 [is place described with metadata by] $0, or should we change this row to not mapped?

pennylenger commented 3 months ago

Hi Cypress, you are right. According to the decision, if $0 exists alongside a $1 in the same field, ignore $0. I have changed it to not mapped.

CECSpecialistI commented 3 months ago

We are going to reverse that decision because it was a bad decision.

CECSpecialistI commented 3 months ago

We just haven't gotten our ducks (or...$0's and $1's) in a row yet.

GordonDunsire commented 3 months ago

Zhuo's mapping is fine. It applies to the case where both subfields $0 and $1 are present in the record. Subfield $1 is a rwo, and subfield $0 is an authority record (i.e. structured description about the rwo). The context of the decision is the preservation of as much data in the record as possible, rather than its utility. We might want to revisit the context :-)

The use of the canonical version of the relationship element sidesteps the issue of what kind of value is recorded in subfield $0. This avoids the transform having to determine if the value is an IRI or identifier for the authority record (which is always assumed to be a manifestation)..

cspayne commented 3 months ago

The use of the canonical version of the relationship element sidesteps the issue of what kind of value is recorded in subfield $0. This avoids the transform having to determine if the value is an IRI or identifier for the authority record (which is always assumed to be a manifestation)..

RDF does not allow that kind of ambiguity, we need to have a way to determine whether a value is an IRI or identifier. Can we assume that IRIs will begin with http: or https:, or are there other schemes of IRI we are likely to encounter in a MARC record?

lake44me commented 3 months ago

I'm inclined to agree with Gordon but want to point out that both $0 and $1 are repeatable (although $2 Source is not!). Does the line 4 mapping "work" in the case where there are two or more of each? (unlikely for $0 but not impossible). What about when sources are different between $0(s) and $1 (s)?

cspayne commented 3 months ago

This mapping is based on the decision made on 2022-03-30, which was overwritten 2023-03-01 with the current decision in the index (which we now want to revisit).

GordonDunsire commented 2 months ago

The thinking is that the default treatment of subfield $0 should be as a string (identifier); using the canonical (no range) form of the element should automatically default the object value to a quoted (string). The string can then be converted an object IRI by any local application; it just a matter of whether the transform attempts this, or it is left to local post-processing. Use of the canonical form in the transform output is an indicator of this situation; if we used the datatype form of the element a local application might assume the value has already been processed (and who really reads the accompanying notes? ;-)

I vaguely recall that the original decision was overwritten because of the repeatability issue raised by @lake44me, and maybe information (from @AdamSchiff?) that not all targets of subfield $0 are structured metadata according to the RDA definition.

I think the essential point, not yet fully discussed, is the detection of IRIs as values of subfield $0. We know that practice includes the recording of subfield $0 instead of $1, and that conversion from 'authority' to rwo for known cases like LCNAF might be feasible.

cspayne commented 2 months ago

@AdamSchiff @GordonDunsire @pennylenger @CECSpecialistI

Based on our discussion on 2024-07-10 in the M2R meeting and updated based on Adam's and Crystal's replies:

Does this look right? Does this account for everything?

AdamSchiff commented 2 months ago

Do we not have a problem if the $1 is from id.loc.gov? Because that is an agent RWO. Then again, the MARC coding of authority records for geographic places, including jurisdictions is 151 (geographic place), not 110 (corporate body), so now I'm questioning everything we discussed this morning. An entity coded 151 in the NAF can never be an RDA place?

Cypress, there is something left out I think: after you get the Geographic Area Code from the name authority for a place, you must delete the trailing hyphens when you append it to http://id.loc.gov/vocabulary/geographicAreas/ to get the geographic area code IRI. For example, the GAC for Germany is e-gx--- but the IRI is http://id.loc.gov/vocabulary/geographicAreas/e-gx

cspayne commented 2 months ago

Cypress, there is something left out I think: after you get the Geographic Area Code from the name authority for a place, you must delete the trailing hyphens when you append it to http://id.loc.gov/vocabulary/geographicAreas/ to get the geographic area code IRI. For example, the GAC for Germany is e-gx--- but the IRI is http://id.loc.gov/vocabulary/geographicAreas/e-gx

Thank you Adam! I've updated it to reflect this.

For a $1 value from id.loc.gov that is an agent RWO (if we choose to stick to what we previously discussed), we could potentially move backwards from the RWO IRI to the NAF and retrieve the Geographic Area Code from there.

CECSpecialistI commented 2 months ago

Since $0 and $1 are repeatable, we should create triples for each occurrence of each $0 regardless of whether a $1 exists. And always map the $a's. CEY, CP, PS 2024-07-11

GordonDunsire commented 2 months ago

Example: Afghanistan

LCNAF: https://id.loc.gov/authorities/names/n79063030 LC rwo: http://id.loc.gov/rwo/agents/n79063030 LC rwo type: http://www.loc.gov/standards/mads/rdf/v1#Geographic This the has definition 'Describes a resource whose label represents a geographic place or feature, especially when a more precise geographic determination (City, Country, Region, etc.) cannot be made.' The MADS supertype leads nowhere (pun intended :-) but the definition appears to be compatible with rdac:C10009 'A given extent of space'. It is safe to use the LC 'name' rwo IRI as an instance of RDA Place. The IRI subfolder structure is 'misleading', but IRIs do not have intrinsic meaning.

LC Countries: http://id.loc.gov/vocabulary/countries/af has type http://id.loc.gov/vocabulary/countries/MARC_Country This has a circular definition 'Resources that are instances of a MARC Country are described with this Class'. However, the instance IRI is subtyped as a MADS Geographic, so again this is compatible with RDA and it is safe to use the LC country IRI as an instance of RDA Place.

The outstanding question is: what is the subject of a triple generated from field 257? The spreadsheet has rdam:P30086 'has place of production' as the property, so the subject is RDA Manifestation. But this is wrong because the RDA definition 'Relates a manifestation to a place that is associated with the inscription, fabrication, construction, or other method of production of an unpublished manifestation' is very different from the MARC field 257 definition 'Name or abbreviation of the name of the country(s), area(s), etc. where the principal offices of the producing entity(s) of a resource are located. Entity(s) in this instance is the production company(s) or individual that is named in the statement of responsibility (subfield $c) of field 245 (Title Statement)' which refers to an agent who is responsible for a work/expression, but the RDA definition refers to an agent who is responsible for a manifestation.

Pre-3R RDA struggled to clarify the distinction between 'production' associated with radio, tv, film, and recorded performance productions and 'production' associated with the processes of creating a manifestation. This was additionally compounded by 'production' as a generic term for publication, distribution, and manufacture and as a separate term for artisanal creations such as paintings, manuscripts, drawings, etc. Official RDA now restricts the term to this last meaning.

I suspect much of the MARC 21 data does not make this distinction, unless the cataloguer realises the implication of a separate field 260 'Publication, distribution, etc.' for place of creation of the manifestation. The definition of 260 'Information relating to the publication, printing, distribution, issue, release, or production of a work' only makes sense if 'work' is synonymous with 'manifestation', but in that case the usual MARC 21 terminology is 'resource'. I think the MARC 21 definition reflects the conflation noted above, which ultimately stems from AACR2 usage.

I think field 257 is quite messy ...

AdamSchiff commented 2 months ago

We were not suggesting using the LC countries, but the LC Geographic Areas: http://id.loc.gov/vocabulary/geographicAreas. There are places used in 257 (e.g. Hong Kong, http://id.loc.gov/vocabulary/geographicAreas/a-cc-hk) that are not countries.

cspayne commented 2 months ago

We have also changed the property from rdam:P30086 'has place of production' to rdaw:P10316 'has related place of work'

GordonDunsire commented 1 month ago

Instances of LC Geographic Areas are typed as skos:Concepts and treated as an (attribute) authority file. In most cases, there is no semantic issue in saying that an instance that is a concept is also an instance of a place because Place can be modelled as a sub-type of Concept. However, instances include 'cold regions', 'Commonwealth countries', 'French community', etc. and these are not places. If we use the LC vocabulary, we must filter out the non-places from the transform.

cspayne commented 3 days ago

When a source for $a is provided in $2, can we use this to mint an IRI, or should we stick with an opaque IRI?

GordonDunsire commented 2 days ago

@cspayne: I think we can extend the de-duplication method. If a source is known, the subfolder structure of the minted IRI should reflect the entity and the source; e.g. '../place/lcsh/...' and the final local part can be the heading stripped of punctuation and spaces. Of course, this is only if there is no subfield 1 ...). Let's try it and see.

cspayne commented 8 hours ago

Test data is here

AdamSchiff commented 7 hours ago
257-test1wor F257 ## $a Germany $0 http://id.loc.gov/authorities/names/n80125931 $1 http://id.loc.gov/rwo/agents/n80125931 $2 naf F257 ## $a Italy ; France. Italy ; France. F257 ## $a Wisconsin $0 http://id.loc.gov/authorities/names/n79022855-781 $1 https://www.wikidata.org/wiki/Q1537 Wisconsin 257-test1exp Germany Some comments: 1. Why is this repeated?: 2. The semicolon indicates that there are two place entities that are the countries of production: F257 ## $a Italy ; France. Italy ; France. Can we use the presence of the semicolon to split these up and then do a lookup in the NAF for Italy and for France? 3. This has retrieved the wrong record in id.loc.gov. It has retrieved the geographic subdivision record for Wisconsin used in $z of subjects, not the authority record for Wisconsin, which would be http://id.loc.gov/authorities/names/n79022855. The geographic code could be grabbed from there: n-us-wi F257 ## $a Wisconsin $0 http://id.loc.gov/authorities/names/n79022855-781 $1 https://www.wikidata.org/wiki/Q1537 Wisconsin
AdamSchiff commented 7 hours ago

In more recent cataloging instead of 257 Italy ; France a cataloger would have recorded 257 Italy $a France $2 naf (or it might also be 257 Italy ; $a France $2 naf) so the presence of two subfield $a's would indicate that there are two places for country of production. The semicolon could be thrown out if it's present in front of a $a.

I know that we are using the RDA property "has related place of work", but that is very general. Could we also add some kind of note specifying the nature of the relationship (i.e, that it refers to country of production)?

AdamSchiff commented 7 hours ago

Oh and a P.S.: 257 should never have a state like Wisconsin. It is supposed to be a country. All American productions should have 257 United States, not a particular state. That record should be cleaned up. Can you tell me what OCLC record this came from, and I can fix it?

cspayne commented 6 hours ago
  1. Why is this repeated?: ](http://id.loc.gov/vocabulary/geographicAreas/e-gx%22/%3E) ](http://id.loc.gov/vocabulary/geographicAreas/e-gx%22/%3E)

It is repeated because the field F257 ## $a Germany $0 http://id.loc.gov/authorities/names/n80125931 $1 http://id.loc.gov/rwo/agents/n80125931 $2 naf has both a $0 and a $1 that are resolved to that IRI. We have agreed to not be concerned about duplicates, since this will automatically be de-duplicated in post-processing. If I serialized this into turtle or re-serialized it into RDF using Python, the duplicates would disappear.

  1. The semicolon indicates that there are two place entities that are the countries of production: fake:marcfieldF257 ## $a Italy ; France.</fake:marcfield> rdawd:P10316Italy ; France.</rdawd:P10316> Can we use the presence of the semicolon to split these up and then do a lookup in the NAF for Italy and for France?

We could split them up, but in this case we could not do a lookup in NAF. An IRI is retrieved based on the $0 or $1 value and this field does not have those.

  1. This has retrieved the wrong record in id.loc.gov. It has retrieved the geographic subdivision record for Wisconsin used in $z of subjects, not the authority record for Wisconsin, which would be http://id.loc.gov/authorities/names/n79022855. The geographic code could be grabbed from there: n-us-wi fake:marcfieldF257 ## $a Wisconsin $0 http://id.loc.gov/authorities/names/n79022855-781 $1 https://www.wikidata.org/wiki/Q1537</fake:marcfield> rdawd:P10316Wisconsin</rdawd:P10316>

Oh and a P.S.: 257 should never have a state like Wisconsin. It is supposed to be a country. All American productions should have 257 United States, not a particular state. That record should be cleaned up. Can you tell me what OCLC record this came from, and I can fix it?

I think I may have grabbed this example from a different field to test how the lookup functions were working and did not focus on the accuracy of the values, that was my mistake. I don't think these values were taken directly from a record, they were put together just for testing.

In more recent cataloging instead of 257 Italy ; France a cataloger would have recorded 257 Italy $a France $2 naf (or it might also be 257 Italy ; $a France $2 naf) so the presence of two subfield $a's would indicate that there are two places for country of production. The semicolon could be thrown out if it's present in front of a $a.

If there were two $a's present (and a $2 naf) the result would be two minted places, one for each subfield $a, where the nomen source is 'http://id.loc.gov/vocabulary/subjectSchemes/naf'. If there is no $2, it would result in two separate properties with string values.

I know that we are using the RDA property "has related place of work", but that is very general. Could we also add some kind of note specifying the nature of the relationship (i.e, that it refers to country of production)?

Yes! This could be added to the mapping and the transform if it would be useful!

I can adjust the code and test input and re-run this with more/better examples :)

cspayne commented 4 hours ago

Here is the updated output I coded to account for a semicolon separating countries within one subfield, as well as ending in a semicolon or period.

If we implement a note on work that says something like "Country of production: [subfield value]", should this only be implemented for the $a values or for the IRIs that are produced as well?