the-human-colossus-foundation / oca-spec

Overlay Capture Architecture Specification
European Union Public License 1.2
8 stars 7 forks source link

Change OCA-spec for RegEx format rules for DateTime #52

Closed carlyh-micb closed 1 month ago

carlyh-micb commented 8 months ago

Suggested change to the OCA specification - give example of RegEx and explain it in contrast to ISO standard for format rules for DateTime. Right now the only example is ISO standard and this creates significant issues downstream when you have date attributes that aren't ISO standard. https://github.com/agrifooddatacanada/oca-spec/tree/master/docs/specification#format-overlay

What happens when a data set has date that isn't in ISO format for DateTime? The data may not be easily changed if it is data coming from an instrument.

If DateTime is expressed in ISO notation for format only, then, in the data example, date DataType must be "Text" because the date example cannot be expressed in ISO notation. We would then create format rules for text dates (such as a RegEx expression for dates with slashes) for the format overlay.

In our data verification code at ADC, DateTime is only allowed to be expressed in ISO notation, and it calls a library that converts the ISO notation into a RegEx rule for data verification.

However, it's a pain for users, their schema will change when they specify the date first as text (burned into capture base) and then if they change the data to ISO standard and switch the datatype to DateTime now the capture base is different.

image

pknowl commented 8 months ago

@mitfik See the highlighted date format in the screenshot above. I would naturally format that as DD/MM/YYYY. However, that is not ISO 8601 compliant. Would it make sense to use a RegEx format in this case? The spec is accurate as it stands. However, we could add a note regarding non-ISO date formats. Your thoughts?

blelump commented 8 months ago

Unfortunately, the format overlay introduced ambiguities that we observe across different use cases. See https://github.com/the-human-colossus-foundation/oca-spec/issues/38 or https://github.com/the-human-colossus-foundation/oca-spec/issues/44 . Formatting covers a broad area of topics and occurs in various contexts. What we have observed so far is that formatting issues occur when presenting or capturing data. Both are not related to semantics but to presentation and/or business requirements. Whether the input date is DD/MM/YYYY or YYYY-MM-DD does not matter from the semantics perspective because the common denominator is the DateTime type, currently (also implicitly) used as ISO8601. Any further formatting (contextual tailoring) required for presentation and/or business requirements is secondary to that and must be addressed differently and separately. It is worth noting that date formatting is also culture-dependent, which adds more complexity.

Invariants mentioned in a separate issue and described deeper here partially address formatting, i.e., the business says it only takes care of a year and a presentation that focuses on cultural differences.

Semantics + invariants + presentation constitute a trio that must always be considered when thinking of digitally stored information.

mitfik commented 1 month ago

Problem is resolved by enforcing ISO8601 which capture essence of structural part of semantic. The rest as @blelump pointed out is addressed on other layers.