microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

support slot for when document was added #1916

Closed aclum closed 5 months ago

aclum commented 6 months ago

For some classes, like workflow activity records we have information about when data was processed, I would like similar information about all classes, especially Study, Biosample, OmicsProcessing, and DataObject. We are routinely asked how much data has been added to the data portal on a per quarter or annual basis and need to make this easier to get a handle on. I frequently have to go off memory or restore from backups. I believe @eecavanna has similar experiences. I would like to discuss options either in the schema or in runtime to add information about when a document was added to a collection.

@turbomam @dwinston @PeopleMakeCulture @shreddd

PeopleMakeCulture commented 6 months ago

@dwinston and I can update the runtime to support analytics queries against when data was created/updated in mongo. There's a couple ways to go about this, one of which would not require updates in nmdc-schema.

@aclum could you:

  1. give a sense for the urgency of the request to help us decide on the best approach?
  2. If we decide to move forward with a runtime approach that does not require a new slot to be added to the schema, could you close this issue and open one in the runtime repo?

Approach Options

1. New created_at attribute/slot

One approach is to add created_at and updated_at fields for individual collections (eg Biosample, Study, etc). However this would introduce an issue with data validation if an equivalent slot is not added in the nmdc schema.

2. New ledger collection

A second approach is to create an append-only ledger of datomic entries. This would allow for a broader range of search queries and preserve update histories. However this would add additional complexity to querying and maintenance.

aclum commented 6 months ago

I was able to get what I needed for the quarterly report from Eric restoring some of the backups so this isn't urgent but I would like to see this addressed this quarter.

eecavanna commented 6 months ago

To elaborate on the previous comment: I temporarily restored two backups into a Mongo server and @aclum then queried those databases to get the information she was interested in. I'm going to delete those temporary restorations now.

image
PeopleMakeCulture commented 5 months ago

I was able to get what I needed for the quarterly report from Eric restoring some of the backups so this isn't urgent but I would like to see this addressed this quarter.

Great! That should give @dwinston and me enough time to implement the more robust append-only ledger solution, pending any larger decisions from the 4/25 database discussion

turbomam commented 5 months ago

Good discussion. Where does this stand as a schema request?

If it is a schema request, how would the slot we're talking about relate to the add_date slot?

aclum commented 5 months ago

The plan is to discuss at the infrastructure sync today. If no schema development is needed I'll convert this to a nmdc-runtime issue.

aclum commented 5 months ago

RE add_date, this current pulls from GOLD so is the GOLD added date so it would be good to clarify that at some point.

eecavanna commented 5 months ago

If it is a schema request, how would the slot we're talking about relate to the add_date slot?

Here's a link to the documentation for the add_date slot: https://microbiomedata.github.io/nmdc-schema/add_date/.

Here are ways that I think I'd want the slot's specification to change if it were going to be used in the way people are talking about here:

shreddd commented 5 months ago

Also consider pulling from Mongo ObjectID which encodes the timestamp.

eecavanna commented 5 months ago

I just learned about that option within the past couple days (never knew that)! There is one caveat that I think exists with that option: based on what I read, the timestamp encoded in the ObjectId indicates when the ObjectId was created, not when "the [rest of the] document" was created. So, if we were to restore from a backup and not use the --preserveUUID flag when doing so, I think the ObjectIds would all describe the restoration time, not the original creation time. Note: I haven't confirmed that suspicion through testing yet—it's just something that came to mind when I was reading about the fact that the ObjectId contains a timestamp.

image
aclum commented 5 months ago

ie https://steveridout.com/mongo-object-time/ Closing for now will use db.comments.find({_id: {$gt: ObjectId("5272e0f00000000000000000")}}), where 5272e0f00000000000000000 is the target date, syntax for now.

eecavanna commented 5 months ago

I want to emphasize that the Mongo docs say that the timestamp embedded in the ObjectId indicates the creation time of the ObjectId. I think the author of that converter may be making a "logical leap" by assuming it also indicates the creation time of the [rest of the] document, itself. That's something I haven't tested yet—I just want to reiterate that distinction here.

image