isamplesorg / metadata

Collation of metadata examples and notes for the project
https://isamplesorg.github.io/metadata/
8 stars 2 forks source link

Seemingly futuristic dates in SESAR records #30

Closed dannymandel closed 1 year ago

dannymandel commented 2 years ago

I've been looking at the date formats in the SESAR records, and there are some pretty strange dates in the data, for example:

6198-11-09 5196-08-09 2312 22010-11-29 2069-12-06 "2049-05-19" "2049-05-18" "2049" "2048-10-27" "2048-10-21" "2048-10-19" "2048-09-02" "2048-09-01" "2048-08-30" "2048-08-28" "2048-08-26" "2048-08-24" "2048-08-15" "2048-08-14" "2048-08-13" "2047-09-11"

Any idea on how we should interpret these dates?

dannymandel commented 2 years ago

@datadavev FYI

datadavev commented 2 years ago

Record as provided. Missing days and months default to "1" where necessary to support date types.

ramdeensarah commented 2 years ago

@dannymandel can you verify which fields these dates are coming from? Is it 'collection_start_date' only, or does it include other date related fields (collection_end_date, publish_date, etc.)?

I had assumed these errors were in the publish_date field, but I found a large cluster in 'collection_start_date' that all stem from one submission. I have the source file, and it looks like it was an ingestion error (11/9/67 converted to 2067-11-08 23:00:00-06) where dates were given in two digits and the system added 20 instead of 19 to the front (in addition to some other wonky stuff).

I will work on cleaning these up but I want to confirm I get everything.

Thanks! Sarah

ramdeensarah commented 2 years ago

Following up on this. I have corrected the collection_start_dates for ~1070 records. We have 8 more that need review/correction (which includes a few records that have issues with collection_end_dates). This may take longer to address as they are older samples that were ingested before we started rigorous record keeping (pre 2010).

SESAR does not have a policy about including dates in the future for collection_start_date. We define collection_start_date as "Date when the sample was collected" and we do not encourage it as part of the pre-registration process. We (SESAR) can discuss this as a team for how to handle pre-registered samples in the future (and include this as part of the batch/curatorial review).

dannymandel commented 2 years ago

@ramdeensarah I'll confirm which field these came from. It looks like I didn't include that information in my notes but it should be easy to look it up.

dannymandel commented 2 years ago

OK, here's an example:

"igsn": "HRV002EG4"
 "collectionEndDate": "6198-11-09",
"collectionStartDate": "6198-11-09"
dannymandel commented 2 years ago

Here's another:

"igsn": "ARF00057C"
"collectionStartDate": "2312",
dannymandel commented 2 years ago
"igsn": "DSR00050E"
"collectionStartDate": "2069-12-06",
dannymandel commented 2 years ago

So yeah, it looks like collectionStartDate was the one I was running into.

ramdeensarah commented 2 years ago

Got it. Thanks!

For background, we have three public fields with dates. They are "Release Date", "Collection Start Date" and "Collection End Date". We also have internal administrative fields with dates, but I am not sure if the APIs have access to these fields. But I wanted to double check.

Summary Release date We have a policy to allow "Release Date" to be set 2 years into the future to align with NSF guidelines. I checked though and we have about 6700 records with release dates that are more than 2 years into the future. The furthest out is 50 years from its registration date. I need to investigate these further but we may not be able to resolve those quickly. And we may end up making some exceptions. I will bring it up at the next SESAR team meeting.

Collection Start Date As noted above, we do not have a policy about using this for pre registered samples. I will look into creating one. I cleaned up most of the problems and we are down to 10 which were registered before we started record keeping so not easily fixed. I will report back after the next SESAR meeting (in 2 weeks). (RLJ000002; HRV001TR9; RLJ000003; HRV002EG4; ARF00057C; BDCUK01HH; RIV000002; JMA000001; GEG000001; JEA0522KR)

Collection End Date There are 2 with this issue and they also fall into the category above (HRV001TR9; HRV002EG4)

dannymandel commented 2 years ago

In theory these should fix themselves in iSamples, as they'll get their modified dates updated, and we'll just automatically pull in the updated records.