sul-dlss / cocina-models

Cocina repository data model (implemented in Ruby)
https://sul-dlss.github.io/cocina-models/
3 stars 0 forks source link

Date validation #392

Closed justinlittman closed 2 years ago

justinlittman commented 2 years ago

From sul-dlss/argo#3375, validates dates (per @arcadiafalcone )

Date encodings

w3cdtf, edtf, iso8601
Year:
YYYY (eg 1997)
Year and month:
YYYY-MM (eg 1997-07)
Complete date:
YYYY-MM-DD (eg 1997-07-16)
Complete date plus hours and minutes:
YYYY-MM-DDThh:mm (eg 1997-07-16T19:20)
Complete date plus hours, minutes and seconds:
YYYY-MM-DDThh:mm:ss (eg 1997-07-16T19:20:30)
Complete date plus hours, minutes, seconds and a decimal fraction of a second
YYYY-MM-DDThh:mm:ss.s (eg 1997-07-16T19:20:30.45)
Complete date plus hours and minutes with time zone:
YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00)
Complete date plus hours, minutes and seconds:
YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)
Complete date plus hours, minutes, seconds and a decimal fraction of a second
YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00)

edtf
-YYYY (eg -3999)

iso8601
YYYYMMDD
YYYYMMDDThhmm
YYYYMMDDThhmmss
YYYYMMDDhhmm
YYYYMMDDhhmmss
YYYYMMDDThhmmss.s+
YYYYMMDDhhmmss.s+
justinlittman commented 2 years ago
arcadiafalcone commented 2 years ago

All EDTF dates should be remediated by the end of the week. I need to request updated reports on W3CDTF and ISO8601, but those will probably take longer. H2 data should be coming in clean. MARC is more complicated because the error is in the MARC record, which the person working in Argo may not be able to edit. Currently it looks like all the MARC-derived invalid dates are from records provided by the same vendor, so this may not be a common occurrence.

The re-upload would be treated the same as any other, and require valid dates to pass. (Hopefully remediation will minimize this issue.)

justinlittman commented 2 years ago

Since dates are going to be remediated, I'm moving this to cocina models for validation.

arcadiafalcone commented 2 years ago

Note pattern YYYYMM-- removed from iso8601.

arcadiafalcone commented 2 years ago

Additional ISO 8601 date patterns:

YYYYMMDDThhmmss.s+
YYYYMMDDhhmmss.s+
jcoyne commented 2 years ago

@arcadiafalcone do we want to validate the semantics (e.g. 2022-02-30) of the dates too? Should we permit BC dates? Are there any parts of ISO8601 that we want to disallow?

arcadiafalcone commented 2 years ago

@jcoyne Yes, that would be great.

mjgiarlo commented 2 years ago

@arcadiafalcone I've run these values through a couple different EDTF validators and they show as invalid for EDTF:

What are your thoughts on how we should proceed?

arcadiafalcone commented 2 years ago

According to the LC EDTF specification (https://www.loc.gov/standards/datetime/) those all appear acceptable. I'm curious why they're not passing, but they should be considered valid.

mjgiarlo commented 2 years ago

@arcadiafalcone The following W3CDTF values also appear to be invalid:

Looking over https://www.w3.org/TR/NOTE-datetime, these look like they should be marked invalid.

mjgiarlo commented 2 years ago

@arcadiafalcone 💬

According to the LC EDTF specification (loc.gov/standards/datetime) those all appear acceptable. I'm curious why they're not passing, but they should be considered valid.

I'll have a look.

At a glance, I'm not sure I see these in the LC spec:

arcadiafalcone commented 2 years ago

@mjgiarlo Now I'm thinking it's better to start with the built-in validation and see if we have any data that matches these questionable patterns. Remediation may be the preferable route if we have inconsistencies.

mjgiarlo commented 2 years ago

@arcadiafalcone 💬

@mjgiarlo Now I'm thinking it's better to start with the built-in validation and see if we have any data that matches these questionable patterns. Remediation may be the preferable route if we have inconsistencies.

ok! I'll make sure the DSA reports I run use the same validation so you have solid numbers on this. Thank you. :)

mjgiarlo commented 2 years ago

@arcadiafalcone Now that this is in the cocina-models gem, when should we hook it up/turn it on? e.g., did you want to take a look at the three bad date reports first? (two of which are still being run now...)

arcadiafalcone commented 2 years ago

@mjgiarlo I'd like to review the reports first. I'll give you the all-clear when it's ok to turn on.