ivmfnal / metacat

Metadata Catalog
BSD 3-Clause "New" or "Revised" License
4 stars 5 forks source link

Add modification metadata (user/time) to datasets #35

Closed hschellman closed 1 year ago

hschellman commented 1 year ago

metacat files have modification md as well as creation. Can that feature be added for datasets, especially as we will be modifying them during production.

ivmfnal commented 1 year ago

How exactly are you planning to modify datasets ?

hschellman commented 1 year ago

Mainly by adding files.

One sets up a dataset early in production, as more files with the same characteristics come in, the dataset needs to be updated, if one automates this process, knowing that it has been done and by whom is very useful.

To some extent I’m requesting that datasets share many of the global attributes of files (status, updates, owners …)

One could of course implement this using the metadata fields but using the metacat structure for files makes it more consistent.

On Sep 1, 2023, at 4:45 AM, Igor Mandrichenko @.***> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

How exactly are you planning to modify datasets ?

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/35#issuecomment-1702619146, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DNAYWRERRY7NOSA3PTXYHDFLANCNFSM6AAAAAA37UKK3A. You are receiving this because you authored the thread.Message ID: @.***>

ivmfnal commented 1 year ago

What do you mean by "as more files ... come in, the dataset needs to be updated" ? I am trying to understand what you mean. Is a single file addition becomes an "update", or you mean some separate action to be a dataset update ? Do you mean update of the dataset metadata ?

So if we record a dataset update user and timestamp - what exactly those are the initiator and the timestamp of ?

ivmfnal commented 1 year ago

Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#file-dataset-attributes

hschellman commented 1 year ago

Yes, I do those and it works and I can in fact make a url that does the query. But it does not provide the information like # of files that the dataset list page does.

On Sep 4, 2023, at 10:12 AM, Igor Mandrichenko @.***> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#dataset-queries

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/35#issuecomment-1705550165, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DODXCSPGVUD6QJQA4TXYYDYFANCNFSM6AAAAAA37UKK3A. You are receiving this because you authored the thread.Message ID: @.***>

ivmfnal commented 1 year ago

because that is not a dataset attribute

hschellman commented 1 year ago

Here I mean adding files to the dataset. since datasets are static something has to do that addition and it would be good to know what.

On Sep 4, 2023, at 10:10 AM, Igor Mandrichenko @.***> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

What do you mean by "as more files ... come in, the dataset needs to be updated" ? I am trying to understand what you mean. Is a single file addition becomes an "update", or you mean some separate action to be a dataset update ? Do you mean update of the dataset metadata ?

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/35#issuecomment-1705548307, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DLIAD6WMXVTQGNEQJ3XYYDQ5ANCNFSM6AAAAAA37UKK3A. You are receiving this because you authored the thread.Message ID: @.***>

ivmfnal commented 1 year ago

to get file count, use API function:

   get_dataset(did=None, namespace=None, name=None, exact_file_count=False)

Keep in mind though that getting file count will take time proportional to the number of files in the dataset

ivmfnal commented 1 year ago

I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals. And maintaining such history would be very expensive and I would like to see a good supporting use case for that.

hschellman commented 1 year ago

Most recent change is useful info. I agree that full history would not be useful.

On Sep 4, 2023, at 10:54 AM, Igor Mandrichenko @.***> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals. And maintaining such history would be very expensive and I would like to see a good supporting use case for that.

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/35#issuecomment-1705581600, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DKQAUI7N6AXYXJ7WMTXYYIWZANCNFSM6AAAAAA37UKK3A. You are receiving this because you authored the thread.Message ID: @.***>

ivmfnal commented 1 year ago

Here are changes which can be done to a dataset (not necessarily all of them are implemented as of now):

so which of those do you think should be reflected in the update time/author ?

ivmfnal commented 1 year ago

I am trying to make a point that if you are lucky, the change you are interested in will be most recent. But there are good chances that the change you are looking is no longer the most recent one and therefore the information is gone.

ivmfnal commented 1 year ago

If we want this feature to be useful without maintaining the whole modification history, we need to narrow the qualified events list as much as possible and include only rare and significant event or events. I would suggest change of metadata because that is supposed to be very rare action and yet it can be significant.

ivmfnal commented 1 year ago

FYI: for files, qualified updates are:

as you can see, file parentage, size and checksums are not supposed to ever change. Metadata can be changed, but only on special events like changes of the metadata namespace or correcting errors

ivmfnal commented 1 year ago

Done. Please upgrade your client to 3.40.0 Dataset now has new attributes: updated_by and updated_timestamp attributes accessible via API and UI. The following events will trigger the changes in these attributes:

hschellman commented 1 year ago

Thanks, this is good. Hopefully we don’t change those items frequently.

On Sep 9, 2023, at 2:59 PM, Igor Mandrichenko @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Done. Please upgrade your client to 3.40.0 Dataset now has new attributes: updated_by and updated_timestamp attributes accessible via API and UI. The following events will trigger the changes in these attributes:

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/35#issuecomment-1712649563, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DJ4QLVDUYM4RPFLHQLXZTRFRANCNFSM6AAAAAA37UKK3A. You are receiving this because you authored the thread.Message ID: @.***>