ivmfnal / metacat

Metadata Catalog
BSD 3-Clause "New" or "Revised" License
4 stars 5 forks source link

Maintain list of categories - and can one add descriptions to them #30

Closed hschellman closed 1 year ago

hschellman commented 1 year ago

It is very useful to keep track of the list of categories (samweb was values and parameters).

Can we get a functionality similar to samweb list-categories

And in an ideal world, having the ability to add a description to those categories would be very useful.

the use case is checking what categories already exist before creating new ones.

It would also be good to restrict the ability to create new ones on the main tree to admin. With user.X.Y as an exception

ivmfnal commented 1 year ago

We already have list of categories: https://metacat.readthedocs.io/en/latest/ui.html#parameter-categories There is already a way to restrict creation of new categories and metadata parameters in category.

ivmfnal commented 1 year ago

We need to clearly distinguish between categories and parameters because they are different things.

hschellman commented 1 year ago

What end users really need is both a list of categories but more importantly a list of category+parameters. Ie all possible query fields that exist.

ie metacat category list --all which includes all possible fields which are a combo of cat+parameter

ahiguera-mx commented 1 year ago

I will say a functionality like metacat category show info, where in addition to the current info it returns a list of parameters (or subcategories), like in constraints

metacat category show info
Path:             info
Description:      None
Owner user:       admin
Owner role:       
Creator:          admin
Created at:       2022-02-22 14:36:53 UTC
Restricted:       no
Constraints:
  counter                                         int [0 - ]
  done                                        boolean
  odds                                            int (1, 3, 5, 7)
  pi                                            float [3.0 - 4.0]
  word                                           text ~ '[A-Z].*'
ivmfnal commented 1 year ago

@ahiguera-mx : This is not exactly how categories work. Unrestricted category can have any parameters under it. In other words, category does not have a list of parameters under it. It has list of constraints, which is already shown by UI.

For restricted category, constraints actually define the list of allowed parameters

ivmfnal commented 1 year ago

I implemented the requested functionality, e.g:

$ metacat query -s keys files from dc4:dc4      # list all the top level metadata keys for files selected by a query
$ metacat query -s keys files                   # scan entire database - will take very long time
hschellman commented 1 year ago
which metacat /cvmfs/dune.opensciencegrid.org/products/dune/metacat/v3_29_0/NULL/bin/metacat metacat query -s keys files from dc4:dc4 MQLSyntaxError: No terminal matches 'k' in the current parser context, at line 1 col 1 keys files from dc4:dc4 ^ Expected one of: * FILES * JOIN * QUERY * FIDS * DATASETS * __ANON_0 * UNION * FILE * PARENTS * LBRACE * LSQB * FILTER * CHILDREN * LPAR
hschellman commented 1 year ago

turns out the version in ups is 3.29 and the version in GitHub is 3.35

works with 3.35

hschellman commented 1 year ago

But can we save time long run by doing an internal table that just has the list and is added to when someone adds a parameter. It will be very bad if people keep running a full scan every time they want to add a parameter.

imandr commented 1 year ago

I do not think it's worth the effort and the integrated loss performance checking if there are any new parameters in every file being declared.

Plus, what do we do someone, say renames some parameters by updating metadata for multiple files.

People should not need to run full database scan. At least I can not come up with a use case for that. I think in most cases limiting the scan to a dataset should be sufficient.

hschellman commented 1 year ago

There will not be new parameters in every file, values of the parameter yes, but not parameters themselves. The idea is to avoid having users invent new parameters which are duplicative of any other parameter in the DUNE instant.

We already have multiple examples of this. People don’t realize a parameter already exists and redefine it, slightly differently.

So I want to repeat the request that we have an official list of category/parameters that people should be expected to scan before creating new ones.

Look at the list I found. (Attached) there are multiple duplicates already.

On Jul 12, 2023, at 10:42 AM, imandr @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

I do not think it's worth the effort and the integrated loss performance checking if there are any new parameters in every file being declared.

People should not need to run full database scan. At least I can not come up with a use case for that. I think in most cases limiting the scan to a dataset should be sufficient.

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/30#issuecomment-1632778973, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DNYSRDUABBAL7AVWXTXP3AXVANCNFSM6AAAAAA2ETFP2I. You are receiving this because you authored the thread.Message ID: @.***>

imandr commented 1 year ago

Not every file will have new parameters, but every file will need to be checked for new parameters.

I understand the idea. I do not see a way to implement this idea without performance penalty. And I still do not see the use case for this functionality.

I think that official list of categories and parameters can be implemented much more easily by a combination of:

imandr commented 1 year ago

Also, I think DUNE should start adapting to the change in the philosophy between SAM and Rucio/MetaCat. Unlike SAM, in Rucio/MetaCat, files are organized into namespaces and datasets and the meaning of metadata and metadata namespace can be different between namespaces/datasets as those groups of files could be created by different groups of people for different purposes using different metadata conventions.

That is why scanning the entire metadata database for a list of parameter names does not look like a meaningful function.

Is not this why you use some sort of SAM metadata query to come up with the list of meaningful parameter names ?

hschellman commented 1 year ago

We want these to be consistent across the experiment. We really do not want different epochs and detectors (which may well have different namespaces) to have different definitions of terms. On D0, we kept control over this by requiring admin privilege to add any parameter. But that does require the ability to know the list.

You have designed something that is elegant and flexible. I want to make it so people can’t use that flexibility to cause chaos. In particular, there need to be categories that people cannot just add a parameter to and we need to be able to generate a list of those parameters (and be responsible for documenting them)

I am suggesting that we do 1 scan to capture the parameters and all future addition of parameters (at least in reserved categories) add to that list.

Page 149 of the CDR says: emphasis added…

SAM also allows definition of free-form “parameters” as they are needed within each experiment’s instance. This allows the schema to be modified easily as needs arise. Unfortunately, a major problem is that it is not possible to request a list of the values for a given parameter and there is little protection against typographical errors in parameter names or their values. This has, over time, led to considerable chaos. The new MetaCat extends these concepts to make them easier to search and maintain.

I wrote this, not understanding that the parameter listing functionality had not been carried over.

I hate to be super-insistent but having run major physics groups on multiple experiments, I know what users can and will do and the chaos that results.

Heidi

On Jul 12, 2023, at 1:07 PM, imandr @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Also, I think DUNE should start adapting to the change in the philosophy between SAM and Rucio/MetaCat. Unlike SAM, in Rucio/MetaCat, files are organized into namespaces and datasets and the meaning of metadata and metadata namespace can be different between namespaces/datasets as those groups of files could be created by different groups of people for different purposes using different metadata conventions.

That is why scanning the entire metadata database for a list of parameter names does not look like a meaningful function.

Is not this why you use some sort of SAM metadata query to come up with the list of meaningful parameter names ?

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/30#issuecomment-1632988780, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DKJWIUREMUOEUTTPJTXP3RXTANCNFSM6AAAAAA2ETFP2I. You are receiving this because you authored the thread.Message ID: @.***>

imandr commented 1 year ago

I think once-per month scan can be done by the collaboration as a cron job with publishing results somewhere on the web

hschellman commented 1 year ago

Checking a new file for valid parameters is much easier if you are checking against a reasonably short list….

I would say there should be categories that require admin and checking and other categories that are not checked and not logged.

Sam has no trouble doing this with minimal load.

On Jul 12, 2023, at 12:52 PM, imandr @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Not every file will have new parameters, but every file will need to be checked for new parameters.

I understand the idea. I do not see a way to implement this idea without performance penalty. And I still do not see the use case for this functionality.

I think that official list of categories and parameters can be implemented much more easily by a combination of:

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/30#issuecomment-1632966868, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DP5WL4XQ2EEPXPKP7TXP3P7RANCNFSM6AAAAAA2ETFP2I. You are receiving this because you authored the thread.Message ID: @.***>

imandr commented 1 year ago

It is much easier not to check against such a list. Plus, this does not solve the problem of someone renaming a parameter for all the files to something else.

hschellman commented 1 year ago

That sounds like a reasonable solution.

One thing Andrew Mcnab Brough up was that if one did create such a list, one could assign tokens to parameters and then index by token instead of string.

On Jul 12, 2023, at 1:49 PM, imandr @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

I think once-per month scan can be done by the collaboration as a cron job with publishing results somewhere on the web

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/30#issuecomment-1633042069, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DIPW7PB3V3KNPVSR5LXP3WVBANCNFSM6AAAAAA2ETFP2I. You are receiving this because you authored the thread.Message ID: @.***>

imandr commented 1 year ago

Anyway, I would like to see a description of a use case when someone needs to scan entire database on regular and frequent basis and require fast response. Once we have such a description, I will be able to propose one or more solution for the problem to choose from.

hschellman commented 1 year ago

Generally it’s not scanning the whole db, it’s asking for the list of possible parameters. Which if it were stored and updated whenever a new parameter key arose, would not need to be scanned.

I’m not certain we’re even talking about the same thing. We have about 100 of these things so far. We don’t really want that many more.

On Jul 12, 2023, at 1:58 PM, imandr @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Anyway, I would like to see a description of a use case when someone needs to scan entire database on regular and frequent basis and require fast response. Once we have such a description, I will be able to propose one or more solution for the problem to choose from.

— Reply to this email directly, view it on GitHubhttps://github.com/ivmfnal/metacat/issues/30#issuecomment-1633051762, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DMJWPVZJRLDPEII3O3XP3XTZANCNFSM6AAAAAA2ETFP2I. You are receiving this because you authored the thread.Message ID: @.***>