pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
79 stars 17 forks source link

Problem with analysis specific command line parameters #500

Open pkiraly opened 2 months ago

pkiraly commented 2 months ago

With the current approach of CLI scripts there is a problem. In the specific analysis shell scripts (such as ./validate) we list all the allowed parameters that are list as short and long options. But when we run commands such as "all" we put every parameters into a TYPE_PARAMS variable. When validate parses the parameters with getopt it removes the unknown parameter names, but keeps their parameters.

For example:

./validate --valid_param1 --valid_param2 value2 --unknown_param3 value3 file1 file2  

the parameters (@$) are

--valid_param1 --valid_param2 value2 --unknown_param3 value3 file1 file2  

and after getops parsing they become

--valid_param1 --valid_param2 value2 -- value3 file1 file2

--unknown_param3 is removed, but its parameter value3 remains there!

We have two types of parameters: i) those that have parameter values ii) those that haven't. It is not a problem for tpye ii, but it moves the parameter values of the unknown parameters into the file list, and it cases errors.

How to solve this issue?

I could imagine the following solutions:

  1. creating a filter that uses regex to remove parameters that are not specific to the particular tasks. To generate such regexes is not impossible, but require lots of efforts, and its maintenance might also be problematic. ./common-script does it in a limited fashion.
  2. list all parameters in getopt option lists, but when processing them we simply skip those that are not relevant for the given task. The problem is the same as with 1): too much work, maintenance problems (if we add a parameter for validate, we should create a case for that in every other command line scripts)
  3. introduce analysis specific parameters, such as VALIDATE_PARAMS, COMPLETENESS_PARAMS etc., and in the ./common-script and in ./qa-catalogue we handle those. It is the most elegant solution, but not backward compatible, the current scripts should be revised to check the usage of TYPE_PARAMS (those calls ./common-script directly) and --params (those calls ./qa-catalogue).
  4. Do not use getops, allow every parameters, but handle the unknown parameters on Java side (I am not sure if it is possible).

@nichtich What do you think?

nichtich commented 2 months ago

I think configuration files (JSON or YAML...) should be used instead of options.

pkiraly commented 2 months ago

I think configuration files (JSON or YAML...) should be used instead of options.

Yes, it would be elegant, but it requires to rewrite a number of things - both in the scripts and in Java side (the input for Java would then be a file name, then we should parse the file marshal it to Java objects, manage the irrelevant parameters etc. What about the documentation of the parameters on script levels? Validation on script levels? Some of the questions I think of now.

But either we use options or config file, I think the first step would be the separation of general and specific parameters. So for K10plus it would look like this

general:
 - schemaType: PICA
 - marcFormat: PICA_NORMALIZED
 - emptyLargeCollectors: true
 - groupBy: 001@\$0
 - groupListFile: src/main/resources/k10plus-libraries-by-unique-iln.txt
 - ignorableFields 001@,001E,001L,001U,001U,001X,001X,002V,003C,003G,003Z,008G,017N,020F,027D,031B,037I,039V,042@,046G,046T,101@,101E,101U,102D,201E,201U,202D,1...,2...
 - allowableRecords: '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)'
 - solrForScoresUrl: http://localhost:8983/solr/k10plus_pica_grouped_validation
index:
 - indexWithTokenizedField: true
 - indexFieldCounts: true

I need to think over this option.

nichtich commented 2 months ago

The current configuration format is a set of parameters passed to script qa-catalogue via environment variables, command line options, a from a .env file. The core problem of this issue is additional configuration passed with -p/--params/TYPE_PARAMS as arbitrary strings, later parsed again as parameters of individual analysis scripts. This double-parsing of configuration values should be eleminated. Some ideas:

As we are free to choose, I'd prefer TOML. Here is an example:

# general configuration
# can be overridden by same name arguments to qa-catalogue script
# can also be read from .env file (this should then be removed when proper configuration files exist)

name = "MyCat"
mask = "*.dat"
schema = "PICA"

# some general configuration is currently set by --params
ignorableFields = ["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z",
                   "008G","017N","020F","027D","031B","037I","039V","042@","046G","046T",
                   "101@","101E","101U","102D","201E","201U","202D","1...","2..."]

# additional configuration for individual analysis tasks
[validate]
indexWithTokenizedField = true
indexFieldCounts = true

In any case it's important to keep the configuration names used by qa-catalogue, for instance it's schema , not schemaType! See qa-catalogue --help for a list.

pkiraly commented 2 months ago

@nichtich I like this kind of configuration. The only open question I see now is the following: there is a list of TOML parsers here: https://github.com/toml-lang/toml/wiki, and it includes bash parser as well: stoml. So we should include it as a dependency. I do not see linux package that you can easily install, we should fetch it from the Github project page's release section.