lcnetdev / marc2bibframe2

Convert MARC records to BIBFRAME2 RDF
http://www.loc.gov/bibframe/
Creative Commons Zero v1.0 Universal
89 stars 35 forks source link

Field 362 and firstIssue/lastIssue #74

Closed kiegel closed 5 years ago

kiegel commented 6 years ago

Field 362 (Dates of Publication and/or Sequential Designation) contains beginning and/or ending designations for serial issues. This is a well known problem, but splitting field 362 into beginning and ending dates at the first hyphen leads to bad results.

For example

362 0_ |a al-Sanah 1., al-ʻadad 1. (Kānūn al-Thānī 1953)-al-sanah 60, al-ʻadad kharīf 2012.

becomes:

bf:firstIssue "al" bf:lastIssue "Sanah 1., al-ʻadad 1. (Kānūn al-Thānī 1953)-al-sanah 60, al-ʻadad kharīf 2012"

There is no obvious solution for an algorithm that would split field 362 correctly. Perhaps for converted records BIBFRAME needs a property representing start/end dates together, in other words, an option for a description that is not as fine-grained as firstIssue/lastIssue.

kiegel commented 6 years ago

The statements in field 362 have been created using different rules over time and are quite varied. No algorithm is going to split them into bf:firstIssue and bf:lastIssue elements with complete success until artificial intelligence can be used to imitate human analysis. However, I have examined a number of cases and think it is possible to do a better job than the current version.

Field 362 has two values for the first indicator. Value 1 (unformatted note) is easily handled with bf:note, as is currently done by the converter. Value 0 (formatted style) is the problem and everything below applies to this indicator value.

Instead of splitting at the first hyphen, I suggest using an xsl:choose element and testing for a number of cases. This approach can deal with multiple hyphens, providing better support for internationalization by reducing errors in foreign languages such as Arabic and CJK. The first three cases described below handle special situations, the next four handle hyphens in various positions, and the final one is for everything left over. I am not a professional programmer so I cannot supply finished code, but I have included the tests I used in my analysis. I did not examine 880 fields: if they follow the same patterns they may be okay.

No Hyphen

Some 362 fields have first indicator 0 and an unformatted note (probably should not happen but it does). These go in bf:note, like first indicator 1

362 0_ |a Began in 1989 (OCLC #700325835)

Semicolon Some pre-AACR2 statements use one or more semicolons. There is a common pattern (first example), where a range of numbering precedes the semicolon and a range of chronology follows it, but there are many exceptions (second example). There is no way to reliably split these statements and reconfigure them, so bf:note is probably the best approach. 362 0_ |a no. 1-3; 1975-1977. (OCLC #03498320) 362 0_ |a 3d ser., v. 1-v. 54; 4th ser., v. 1-v. 5 (1937). (OCLC #01564067) Equal Sign An equal sign is used to record parallel statements, usually full parallelism (first example) but sometimes partial parallelism (second example). The statement should be split at the equal sign and then each part processed separately (recursion). 362 0_ |a Vol. 62 (May 1959)-v. 73 (Jan. 1970) = No. 768-896. (OCLC #09419184) 362 0_ |a Sāl-i 1., shumārah-ʼi-i 1. va 2. (pāyīz va zamistān 1374 [fall and winter 1995])-sāl-i 4., shumārah-ʼi 4. (zamistān 1377 [winter 1998]) = shumārah-ʼi payāpay 14. (OCLC #48254118) Trailing Hyphen Many statements end in a hyphen and they go in bf:firstIssue. The second example shows how a problem with additional hyphens is avoided. 362 0_ |a No. 1- (OCLC #08872407) 362 0_ |a al-ʻAdad 1.- (OCLC #26491387) Leading Hypen Less frequently a statement begins with a hyphen and these go in bf:lastIssue. 362 0_ |a -6th (Sept. 28, 1972). (OCLC #04965237) 362 0_ |a -al-taqrīr al-sanawī 5. (1998). (OCLC #49360510) Single Hyphen When a statement contains a single hyphen, it can be split into bf:firstIssue and bf:lastIssue, although there are still a few errors (second example), which contains a first issue with a hyphen used in romanization. 362 0_ |a July 1, 1964-June 30, 1965. (OCLC #13650114) 362 0_ |a al-ʻAdad 4. (Fabrāyir 1985). (OCLC #16153008) Close Parenthesis/Hyphen A reliable indicator of a split point is a close parenthesis next to a hyphen, which should go in bf:firstIssue and bf:lastIssue. Note that the closing parenthesis of the first issue needs to be restored because it is removed by substring-before. 362 0_ |a Vol. 1 (1865-1866)-v. 62 (1926-1927). (OCLC #01460739) Unresolved Cases At this point, not much is left. These statements contains two or more hyphens with no clear marker to split them into first and last issue, if a split is needed at all. I suggest putting them in bf:note, which does not introduce errors for incorrect splitting. Here are some examples. Hyphens in the chronology: 362 0_ |a Jan. 1962-Mar. 1965-Sept. 1966-Dec. 1967. (OCLC # 01796312) Pre-AACR2 statements using a comma or period instead of a semicolon: 362 0_ |a n.F., Heft 1-25, 1929-45. (OCLC # 06547795) 362 0_ |a v. 1-3. Mar. 1967-Mar. 1973. (OCLC # 173327995) First issue with hyphens used in romanization: 362 0_ |a [Dai 1-gō] (Shōwa 41-nen 12-gatsu [Dec. 1966]) (OCLC # 42589608) First and last issue with hyphens used in romanization: 362 0_ |a al-ʻadad .1- Yanāyir 1974-[al-ʻadad 10. (Mārs 2002)]. (OCLC # 02914532)
wafschneider commented 5 years ago

Addressed in commit a99dc139e9669628297af492569de87cc8840754, to be included in v1.4.0.