microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

revisit storage and validation of temporal data #384

Open turbomam opened 2 years ago

turbomam commented 2 years ago

This is a component of https://github.com/microbiomedata/sample-annotator/issues/90

Background

nmdc-schema has a TimestampValue class, based on the AttributeValue class.

In fact the only real data slot for TimestampValue is the very generic, inherited has_raw_value, whose range is string.

TimestampValue's description does say

A value that is a timestamp. The range should be ISO-8601

But that's not enforced anywhere in the schema

Objective

In my understanding, NMDC submitters should be able to enter partial datetimes for things like collection_date. Ie 2022-08 should be accepted as meaning that the sample was collected some time in August of 2022. The day-of-month is not known, and should not be fudged as 2022-08-01

Current solution

So we have configured our DH templates to validate values of 2022-08 from slots like collection_date with heavyweight regular expressions like ^[12]\d{3}(?:(?:-(?:0[1-9]|1[0-2]))(?:-(?:0[1-9]|[12]\d|3[01]))?)?$

And providing examples like "2021-04-15; 2021-04; 2021"

(BTW: https://github.com/GenomicsStandardsConsortium/mixs/issues/446)

You can check those examples at regexr

(BTW see #385)

Proposed solution

It should be possible to at least validate these has_raw_values of TimestampValues against a proper datetime parser. Most Python datetime parsers will silent add 1s to the missing datetime parts. We don't have to use that parsed value, but it should at least parse. I think that will rule out dates that match the regular expression, but don't exist, like 2022-02-31

iso8601 seems to require pretty strict templates, but I think some of these other ones don't

I'll consult with LinkML colleagues and most likely try arrow and pendulum. Will post conclusions here.

mslarae13 commented 2 years ago

@turbomam what is the to do on this?

ssarrafan commented 2 years ago

@turbomam moving this to Sept but please let me know if you're not actively working on it for the next 2 weeks

ssarrafan commented 2 years ago

Checked in with @turbomam and moving this out of the sprint and adding the backlog label.

mslarae13 commented 1 year ago

@ssarrafan This will start on the sprint from Dec26-6th & has a due date of Jan20th for the submission portal squad. Can you plan to add this to those Sprint boards?

ssarrafan commented 1 year ago

I don't think the next sprint will start till January since LBL is closed for the holidays Dec 23-Jan 2. I can add it to that sprint. Are you planning to work the week between December 26 and January @mslarae13?

mslarae13 commented 1 year ago

I don't think the next sprint will start till January since LBL is closed for the holidays Dec 23-Jan 2. I can add it to that sprint. Are you planning to work the week between December 26 and January @mslarae13?

I am working that week! PNNL doesn't close :( So it'll be a sprint of 1 ;) but you can just put it in the sprint starting after the 2nd & it'll (hopefully) be done fast :)

mslarae13 commented 1 year ago

@turbomam I think working on this today would be helpful. In relation to the updates I've made to the soil package relevant slots. Does the validation still hold, do we need to add additional validation rules anywhere?

ssarrafan commented 1 year ago

Due date is Jan 20th so moving to next sprint @mslarae13 @turbomam

ssarrafan commented 1 year ago

Looks like this is in the backlog now so I'll remove from the sprint. @mslarae13 if you plan to work on this next sprint let me know. Thanks.