rism-digital / muscat

🗂️ A Rails application for the inventory of handwritten and printed music scores
http://muscat-project.org
34 stars 16 forks source link

Add 008 field in for machine-processable dates #1151

Open ahankinson opened 3 years ago

ahankinson commented 3 years ago

Edit: The decision (10.11.2021) is to implement this as an 008 field; see the discussion below.


For RISM Online, we are parsing the 260 $c statement to try and extract numeric dates for sources so that we can do proper date range searches (e.g., "Find me sources between year XXXX and year YYYY"). In order to do ranges, it is a requirement that we use numeric data so that arithmetic can be done in the search system.

The 260 $c field is uncontrolled, which makes it difficult to extract dates when attempting to parse this to numbers. We are using some advanced heuristics, but problems remain, and we are reaching the point where correcting some dates causes other corrections to start failing.

Some examples of various systemic problems include:

... and many, many others. This is leading to many problems in RISM Online, where we have extreme dates (e.g., 1871 BC, or 171784 CE).

It would be really useful if we had a field that could handle numeric-only dates, and validate these values. I recognize that this is likely not the goal for 260 $c, since cataloguers will need the flexibility to capture a date statement as written.

I would like to request that we re-introduce the 033 field (Date/Time and Place of an Event), but with the requirement that it contains a standard formatted date. The MARC specification includes a format for how incomplete dates should be encoded as well. This would be restricted to allowing only yyyymmdd values.

Validation on this field would restrict the values to only allow digits and the - character. While this field would be optional, cataloguers would be encouraged to fill this field in with a value if any datable evidence for the source allows.

The $a on the 031 field is repeatable, allowing for a single date or a range of dates. We should restrict this to no more than two -- a "start" (or a single date) and an optional "end" for dates that may be a range.

In the case of a single date, the first field indicator should be a 0. In the case that the optional end date is provided, the first field indicator should be a 2.

A few examples:

Single date, no month or year
031  0#$a1879----

November, 1973
031 0#$a197311--

November 11, 1973
031 0#$a19731111

1685 to 1750
031 2#$a1685----$a1750----

March, 1685 to 1750
031 2#$a168503--$a1750----

Sometime in the 1860s
031 2#$a1860----$a1869----

Sometime in the 1700s
031 2#$a1700----$a1799----

A span of 300 years
031 2#$a1200----$a1500----

None of the hyphens should be omitted in the MARC source, but they could optionally be omitted in the data entry field.

jenniferward commented 3 years ago

I agree that an encoded date field is necessary, but 033 isn't exactly correct for this. The 033 is not for the date of creation of an item but rather it encodes what is stated in the 518 field, which is for performance notes - in libraries, generally when a CD was recorded, in our context generally when a score was used in a performance (somewhat rare). See here for an explanation of standard MARC practice from Yale: https://web.library.yale.edu/cataloging/music/033field

The sort of encoded date that we need is recorded in the 008 and this information is generally derived from the 260 (264), so a connection between the two fields is an established one in MARC. https://www.loc.gov/marc/bibliographic/bd008a.html We would need position 06 to tell us what kind of date(s) it is, then position 07-10 for the first year and 11-14 for the second year.

There are several options for position 06 but I think we can limit it to: s = single date q = questionable date (for estimates) i = known range of dates for a collection

Position 06 includes the possibility of detailed dates that include months and days (e, "detailed date"), but I don't think this level of granularity is needed for retrieval - as Andrew says, we just want "between year XXXX and year YYYY". So I suggest skipping months and days.

Unknowns are recorded with u.

So adapting and expanding on the examples above, in the 008 it would look like this (with some links to the Princeton catalog, for example of usage in libraries): Single year, 1739 008/06: s 008/07-10: 1739 https://catalog.princeton.edu/catalog/9935454763506421/staff_view

Circa 1745 008/06: s 008/07-10: 1745 https://catalog.princeton.edu/catalog/9935507213506421/staff_view

Possibly 1740? 008/06: s 008/07-10: 1740 https://catalog.princeton.edu/catalog/9933418353506421/staff_view

Before 1748
008/06: q 008/07-10: 1uuu 008/11-14: 1748

After 1748
008/06: q 008/07-10: 1748 008/11-14: 1uuu

Note: Princeton encodes just a single year in before/after statements ; not sure if we want that: "Not before 1771": s1771 https://catalog.princeton.edu/catalog/9971601333506421/staff_view "after 1774": s1774 https://catalog.princeton.edu/catalog/99107566203506421/staff_view

Middle of the 18th century using RISM standard 1740-1760 008/06: q 008/07-10: 1740 008/11-14: 1760

November 11, 1973 008/06: s 008/07-10: 1973

1750 to 1799 (estimated by the cataloger) 008/06: q 008/07-10: 1750 008/11-14: 1799 https://catalog.princeton.edu/catalog/9935470083506421/staff_view

1738 to 1743 (based on evidence in the source) 008/06: i 008/07-10: 1738 008/11-14: 1743 https://catalog.princeton.edu/catalog/9935399813506421/staff_view

Sometime in the 1740s 008/06: s 008/07-10: 174u

Between 1740 and 1749 008/06: q 008/07-10: 1740 008/11-14: 1749 https://catalog.princeton.edu/catalog/9980042393506421/staff_view

Sometime in the 1800s 008/06: s 008/07-10: 18uu https://catalog.princeton.edu/catalog/99104392823506421/staff_view

No date 008/06: q 008/07-10: 1600 008/11-14: 1900 (see Comments)

A span of 300 years 008/06: q 008/07-10: 12uu 008/11-14: 15uu (see Comments)

Comments:
The span of 300 years does not come out as clearly in the last example. I'd be open to having a local RISM practice that uses 1200 / 1500 instead of 12uu / 15uu.

In my opinion this field should be required, otherwise people won't fill it out. We really need to encourage people to date their sources more often. It is comfortable to say "s.d." but surely we can figure out at least reasonable centuries based on archival context, institutional history, composer life dates, etc. Perhaps we can come up with standardized estimates that people can apply to make it feel less rigid, for example 1600-1900 or so.

ahankinson commented 3 years ago

008 looks OK. It also allows BCE dates, which the 033 does not. 008 has the disadvantage that it also encodes a lot of other material (Place of publication, Language, etc.) while 033 is specifically date and time.

I'm not completely convinced by the Yale application note; the MARC21 description just says it's a date/time of an event, with no additional semantics. The examples in the documentation seem to indicate that it can be for any material. There are also at least two examples where a 033 does not have a corresponding 518. So I don't think those should be hard-and-fast reasons to not use it.

For 008, I'm not particularly crazy about using 9999. I can't see a case for encoding this sort of date -- we have no serials or things that have not ceased publication. So I don't think that should be an option. The latest possible date should be the current year.

I'm also not particularly keen showing uu to the users for unknown parts of a date. A dash is a neutral space indicator; u implies some semantics. (e.g., "unknown"), which may give some cataloguers pause. ("Is it really unknown? How do I know if it's unknown?")

I envision a field with a fixed number of spaces indicated by dashes. Typing in the field would fill the spaces from left to right; no more, no less. The field would not allow any more than four digits; backspacing would clear a digit and replace it with a dash.

We can, of course, transform the dashes to u behind the scenes if we wanted to stick with the MARC spec when we store it. (Tangentially, I wish the MARC people would sort out their date-time format specs... it seems like there's a different standard for each field!).

We also need to be clear that 12-- does not mean 1300-1399.

There is a difference between 12-- and 1200 -- the latter is a certain date, the former makes no claims on the fixed year. Numerically we would transform it to 1200, but we could also analyze it for extra data like "is an uncertain date" if it contains the uncertainty characters.

We could make it a guideline policy that the field was required, but we would still need it to be optional when saving a record, because people editing a record would need to be able to save it without filling in a date for an item they don't have to hand.

ahankinson commented 2 years ago

At the 10 November meeting we agreed to use the 008 field for storing machine-readable dates, so I think we can move forward with this. I'll change the title and description above to avoid confusion.

One thing left to decide is how to enter the value for position 06 of the 008 field.

jenniferward commented 2 years ago

I was thinking the cataloger would enter the 06 (see above for my reduced list) from a dropdown list but now I wonder if we could simplify this in the Muscat application of 008 and agree to just 2 choices, perhaps: s = single date i = range of dates

Would it be possible for an 06 to be added automatically by Muscat, based on whether a single year or a range of years is entered?

ahankinson commented 1 year ago

I recently came across the 046 field, which seems like it might be a better fit for this.

The advantages of 046 over 008 are:

HirschSt commented 1 year ago

BTW 008 is already used for export, s. https://github.com/rism-international/muscat-maintenance/blob/8fb007f63560f2ab0a9ad4c0cacbc5e5b8e104ab/export/sources_to_bsb.rb#L30

ahankinson commented 1 year ago

Yes; the idea would be to actually add the dates of the record in there, rather than the 'created' date.

jenniferward commented 1 year ago

Sure! Thanks for spotting it. I don't quite see the range of current applications in library catalogs as 008 (I checked Princeton above, Northwestern, and the DNB and found only a few, and can't generalize), but the 046 is clear enough, and our application of it would be clear enough, that I don't see any misunderstandings.

ahankinson commented 1 year ago

Keeping well-structured data in 046 could mean that we could automatically add it (or an extract of it) to the 008, but I don't think we could go the other way around.