Closed aclum closed 5 months ago
@dwinston and I can update the runtime to support analytics queries against when data was created/updated in mongo. There's a couple ways to go about this, one of which would not require updates in nmdc-schema
.
@aclum could you:
created_at
attribute/slotOne approach is to add created_at
and updated_at
fields for individual collections (eg Biosample, Study, etc). However this would introduce an issue with data validation if an equivalent slot is not added in the nmdc schema.
ledger
collectionA second approach is to create an append-only ledger of datomic entries. This would allow for a broader range of search queries and preserve update histories. However this would add additional complexity to querying and maintenance.
I was able to get what I needed for the quarterly report from Eric restoring some of the backups so this isn't urgent but I would like to see this addressed this quarter.
To elaborate on the previous comment: I temporarily restored two backups into a Mongo server and @aclum then queried those databases to get the information she was interested in. I'm going to delete those temporary restorations now.
I was able to get what I needed for the quarterly report from Eric restoring some of the backups so this isn't urgent but I would like to see this addressed this quarter.
Great! That should give @dwinston and me enough time to implement the more robust append-only ledger solution, pending any larger decisions from the 4/25 database discussion
Good discussion. Where does this stand as a schema request?
If it is a schema request, how would the slot we're talking about relate to the add_date
slot?
The plan is to discuss at the infrastructure sync today. If no schema development is needed I'll convert this to a nmdc-runtime issue.
RE add_date, this current pulls from GOLD so is the GOLD added date so it would be good to clarify that at some point.
If it is a schema request, how would the slot we're talking about relate to the
add_date
slot?
Here's a link to the documentation for the add_date
slot: https://microbiomedata.github.io/nmdc-schema/add_date/.
Here are ways that I think I'd want the slot's specification to change if it were going to be used in the way people are talking about here:
created_at
(add_date
, to me, sounds like a function name)Also consider pulling from Mongo ObjectID which encodes the timestamp.
I just learned about that option within the past couple days (never knew that)! There is one caveat that I think exists with that option: based on what I read, the timestamp encoded in the ObjectId indicates when the ObjectId was created, not when "the [rest of the] document" was created. So, if we were to restore from a backup and not use the --preserveUUID
flag when doing so, I think the ObjectIds would all describe the restoration time, not the original creation time. Note: I haven't confirmed that suspicion through testing yet—it's just something that came to mind when I was reading about the fact that the ObjectId contains a timestamp.
ie https://steveridout.com/mongo-object-time/
Closing for now will use
db.comments.find({_id: {$gt: ObjectId("5272e0f00000000000000000")}})
, where 5272e0f00000000000000000 is the target date, syntax for now.
I want to emphasize that the Mongo docs say that the timestamp embedded in the ObjectId indicates the creation time of the ObjectId. I think the author of that converter may be making a "logical leap" by assuming it also indicates the creation time of the [rest of the] document, itself. That's something I haven't tested yet—I just want to reiterate that distinction here.
For some classes, like workflow activity records we have information about when data was processed, I would like similar information about all classes, especially Study, Biosample, OmicsProcessing, and DataObject. We are routinely asked how much data has been added to the data portal on a per quarter or annual basis and need to make this easier to get a handle on. I frequently have to go off memory or restore from backups. I believe @eecavanna has similar experiences. I would like to discuss options either in the schema or in runtime to add information about when a document was added to a collection.
@turbomam @dwinston @PeopleMakeCulture @shreddd