Open pkiraly opened 3 months ago
I think configuration files (JSON or YAML...) should be used instead of options.
I think configuration files (JSON or YAML...) should be used instead of options.
Yes, it would be elegant, but it requires to rewrite a number of things - both in the scripts and in Java side (the input for Java would then be a file name, then we should parse the file marshal it to Java objects, manage the irrelevant parameters etc. What about the documentation of the parameters on script levels? Validation on script levels? Some of the questions I think of now.
But either we use options or config file, I think the first step would be the separation of general and specific parameters. So for K10plus it would look like this
general:
- schemaType: PICA
- marcFormat: PICA_NORMALIZED
- emptyLargeCollectors: true
- groupBy: 001@\$0
- groupListFile: src/main/resources/k10plus-libraries-by-unique-iln.txt
- ignorableFields 001@,001E,001L,001U,001U,001X,001X,002V,003C,003G,003Z,008G,017N,020F,027D,031B,037I,039V,042@,046G,046T,101@,101E,101U,102D,201E,201U,202D,1...,2...
- allowableRecords: '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)'
- solrForScoresUrl: http://localhost:8983/solr/k10plus_pica_grouped_validation
index:
- indexWithTokenizedField: true
- indexFieldCounts: true
I need to think over this option.
The current configuration format is a set of parameters passed to script qa-catalogue
via environment variables, command line options, a from a .env
file. The core problem of this issue is additional configuration passed with -p
/--params
/TYPE_PARAMS
as arbitrary strings, later parsed again as parameters of individual analysis scripts. This double-parsing of configuration values should be eleminated. Some ideas:
move parsing of additional parameters to qa-catalogue
and use a calling syntax like -p key1:value -p key2:value ...
like in other software, e.g. pandoc metadata options. Disadvantage is such repeated option -p
cannot be repeated .env
in config file
Use another config file format (e.g. YAML or TOML). Disadvantage is we need to also change qa-catalogue
to not have two independent confg file formats. It must be possible to put all configuration in one file. I think this is doable.
As we are free to choose, I'd prefer TOML. Here is an example:
# general configuration
# can be overridden by same name arguments to qa-catalogue script
# can also be read from .env file (this should then be removed when proper configuration files exist)
name = "MyCat"
mask = "*.dat"
schema = "PICA"
# some general configuration is currently set by --params
ignorableFields = ["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z",
"008G","017N","020F","027D","031B","037I","039V","042@","046G","046T",
"101@","101E","101U","102D","201E","201U","202D","1...","2..."]
# additional configuration for individual analysis tasks
[validate]
indexWithTokenizedField = true
indexFieldCounts = true
In any case it's important to keep the configuration names used by qa-catalogue
, for instance it's schema
, not schemaType
! See qa-catalogue --help
for a list.
@nichtich I like this kind of configuration. The only open question I see now is the following: there is a list of TOML parsers here: https://github.com/toml-lang/toml/wiki, and it includes bash parser as well: stoml. So we should include it as a dependency. I do not see linux package that you can easily install, we should fetch it from the Github project page's release section.
@nichtich I found an alternative and still reliable option that does not involve new technology into the existing stack. It requires three steps:
A new CLI that exports the Options objects in the *Parameter classes to a structured JSON. The JSON looks something like this:
{
"common": [
{"short": "n", "long": "name", "hasArg": true},
{"short": "x", "long": "marcxml", "hasArg": false},
... // other parameters
],
"validate": [
{"short": "R", "long": "format", "hasArg": true},
{"short": "W", "long": "emptyLargeCollectors", "hasArg": false}
... // other parameters
],
"completeness": [
{"short": "R", "long": "format", "hasArg": true},
{"short": "V", "long": "advanced", "hasArg": false}
... // other parameters
],
... // other analyses
}
a PHP script normalizes the incoming parameters. It gets the name of the analysis to run and all the parameters. Based on the definition of the parameters of the analyses it filter out parameters that is not used by the current analysis. For example:
php parameter-filter.php validate --name KBR --emptyLargeCollectors --advanced
returns
--name KBR --emptyLargeCollectors
because --advanced
is not a valid parameter for validate.
This script is built into the common-script
to preprocess parameters before calling the actual analysis. It will look something like this:
PARAMS=$(php parameter-filter.php validate ${TYPE_PARAMS})
./validate ${GENERAL_PARAMS} ${OUTPUT_PARAMS} ${PARAMS} ${MARC_DIR}/$MASK 2> ${LOG_DIR}/validate.log
I don't understand why you did not directly use Java instead of Java => JSON => PHP. I'd prefer to entirely replace PHP in the backend (the existing scripts use the language for simple tasks only) as we already have three more languages (95% Java, 2% Bash, 2% R). But in the end it works, so thanks!
@nichtich You are right, I should have created it in Java. But I was something else in mind that I did not wrote down: the JSON is also intended to reuse in the web UI to interpret the parameters at the bottom of the analysis pages.
With the current approach of CLI scripts there is a problem. In the specific analysis shell scripts (such as
./validate
) we list all the allowed parameters that are list as short and long options. But when we run commands such as "all" we put every parameters into a TYPE_PARAMS variable. When validate parses the parameters withgetopt
it removes the unknown parameter names, but keeps their parameters.For example:
the parameters (@$) are
and after getops parsing they become
--unknown_param3
is removed, but its parametervalue3
remains there!We have two types of parameters: i) those that have parameter values ii) those that haven't. It is not a problem for tpye ii, but it moves the parameter values of the unknown parameters into the file list, and it cases errors.
How to solve this issue?
I could imagine the following solutions:
./common-script
does it in a limited fashion../common-script
and in./qa-catalogue
we handle those. It is the most elegant solution, but not backward compatible, the current scripts should be revised to check the usage ofTYPE_PARAMS
(those calls./common-script
directly) and--params
(those calls./qa-catalogue
).getops
, allow every parameters, but handle the unknown parameters on Java side (I am not sure if it is possible).@nichtich What do you think?
Based on comment at #issuecomment-2475769005 the (first) implementation follows the steps: