Parameters file in a structured format

hariszaf commented 2 years ago

This issue is about describing the parameter file that is required to run PEMA in a machine-interoperable way: describing the formats and including defaults for all entries in the file, for example Work has begun on this, but it needs a review and completion

@marc-portier BigDataScript supports reading json files; see here.

The readParameterFile function of pema reads the .tsv parameters file line-by-line to return a bds "dictionary".

We could edit this function to read a .json file instead and have this .json file in an RO-Crate oriented way. :sunglasses:

marc-portier commented 2 years ago

@hariszaf, upfront:

I think ro-crate has less of a direct connection to this
the actual format of the parameters file is slightly related (but not necessarily)

IMHO, the issue with the formal parameter-description is this

The current flow is this:

the pema developer decides what are sensible parameters to the process
- they invent and name them during development
- the actual semantics / role / meaning of the parameter is effectively equal to the way the code handles it
- similarly the code implicitly assumes additional properties of those parameters: type, accepted/expected valid values, defaults, required or not?
it is up to the end-user to leverage these by providing actual parameter files for every run
- they are supposed to investigate on what each parameter can do for them
the knowledge transfer from dev to user about "what can each available parameter do for me" is now based on natural text / descriptions being added as # comment-lines in the provided sample parameters.tsv

We see these new needs / challenges

The LW-IJI workflow project today has a form / UI that allows the user to edit and configure those exact pema parameters.

This form is created by LW-IJI developers, that learned what parameters exist from reading the parameters.tsv

The challenge though is that there are multiple versions of pema (and more might be coming) and each of those might introduce new parameters.

So the opportunity arises to formally describe all the aspects and properties of the parameters:

name
title (how to ask for it in a form)
description (help text)
type
required yes/no
default value
(still bit unsure) validation / selection-lists

to be able to have the LW-IJI consume and apply this form-building information it needs to be provided in a certain well-established machine readable format.

So we are looking for a new way to achieve that

This relates to a number of related questions

Do we stick to providing parameters in tsv format?
- pro: do not change the way users interact with the system
- con: tsv does not have a native way to add this kind of schema information
- alternative would be to switch to something like json that does have a json-schema mechanism to describe these parameter-properties
- decision/suggestion: I have the feeling we should keep tsv here for backwards compatibility.
Do we keep everything in one file?
- pro: again, it keeps things as they are know by users now
- pro: it makes it more obvious to keep everything in one place
- con: but we at least have to change the format to make things more "formal and structured"
- con: we will need to provide a custom process (e.g. pema-describe.py --format «format» ) to convert that into an actual machine consumable format
- alternative would be to manually maintain this information in such format file e.g. parameter-description.json next to the existing parameter.tsv
- decision/suggestion: I think we should just define our own way to capture the structured information inside the parameter.tsv file and provide a simple python thingy that converts that into the desired format (json should do)

Suggested syntax (inside `parameter.tsv` sample file that is part of pema?)

# :param: maxInfo
# :maxInfo.title: Maximum Information
# :maxInfo.description: Performs an adaptive quality trim, balancing the benefits of retaining longer reads against the costs of retaining bases with errors.
# :maxInfo.type: bool
# :maxInfo.required: Yes
#
maxInfo Yes

Suggested output (from `pema-describe.py --format json`)

{ "about": "parameters for version vX.Y.Z", 
  "params": [
      { "name": "maxInfo",
        "title": "Maximum Information",
        "description": "Performs an adaptive... errors.",
        "required": true,
        "default": true
      }, ...

hariszaf commented 2 years ago

Hello there.

ro-crate has less of a direct connection to this

I was thinking a few steps later, we forget about it for now.

the actual format of the parameters file

yes sure! i only mentioned that as we had discussed the bds constraint, so we forget about that too for the time being.

The challenge though is that there are multiple versions of pema (and more might be coming) and each of those might introduce new parameters.

The challenge is well described in your comment and pretty clear to me. :rocket:

Do we stick to providing parameters in tsv format?

It is my belief that if we move from .tsv to .json now, it will be a problem for the IJI people. I would say we stick to .tsv and we ask them if it they would be ok. json is an option for pema and that's the good news i tried to share in my initial comment. :smile:

Do we keep everything in one file?

I would say yes, and as you suggest, go for a python scirpt that captures the structured information inside the parameters.tsv.

Next steps

I will start working on the parameters.tsv file following your maxInfo example.
Once it is ready, I could easily write the pema-describe.py. If you have something like a prototype or template for that it would be great!

What s your thoughts on that?

hariszaf commented 2 years ago

If you think that we are ok to go, then my main question is this:

Assuming you have a required parameter A, and based on your selection, then parameter B might mandatory or optional, how would you denote the required section for parameter B?

hariszaf commented 2 years ago

Hi @marc-portier, hi @cpavloud ! :wave:
Would you like to have a look at the parameters_structured.tsv. It needs work but i tried to have a first draft so we can see the challenges. For example, line 337 where a parameter is required if one of a set of marker genes is the one under study, but it is not if another one is selected.

marc-portier commented 2 years ago

@hariszaf some comments based on your first work

prefix ?

# :gene.prefix: gene_

The question popping up is; "What do you expect a form-generator (or anybody else) to do with this information?"

To me, it reads as a validation rule --> all values need to start with this prefix. In general, I think we need to give 'validation' some more thought. For this specific case, I would express the check as a regular-expression match: /^gene_.*$/

values ?

# :targetLength.values: 
#   - 16S: 
#   - 12S:
#   - 18S:
#   - COI: 
#   - ITS: 
#
targetLength    180

Accepted values could be seen as another variant of 'validation'
Side note: I find it odd that your default value 180 is not in the list of accepted values? should it not be 18S ?
I see how multiple lines make your layout easier to read, but also challenges the parsing logic needed... have to think some more, maybe my proposed syntax was too simple?

Indeed, further down the tsv file I saw you also need additional descriptions to values

# :pandaseqAlgorithm.values: 
#   - pear:  
#        description: uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide
#    - simple_bayesian: 
#        description: uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.
#    - stich
#    - flash

and have even more types of validation:

# :threshold.type: integer
# :theshold.values: 0.0 < threshold < 1.0
threshold 0.6

side-note: this 0.6 suggests the type is float(double), not integer

But worse: These requirements pushes the structure effectively into a nested model. (not a simple flat able any longer) And they simply break the capacities of the simple syntax I started suggesting... This makes me think we should maybe approach this from a different angle altogether? Motivation: I would rather go for something different with better support from the get-go then keep on fixing as we go along?

So. What to think about this new idea:

keep the comments trick with some agreed extra character (=) to mark the line as a part of the yaml syntax
but let us switch to yaml to describe the parameters
- pro: less syntax issues to consider then json, but equally easy to have nested structures
- con: whitespace place a role in yaml, will require some attention to keep count properly

like that we can interlace (similar to my previous suggestion) the yaml with the tsv, but also introduce a more flexible structure

We then get a simple tsv files with 3 kinds of lines

no # at the start --> simple data line with actual parameter name + value
# at the start --> simple comment line, hidden from the classic tsv processing
#= at the start --> special extra indicator, indicates part of the yaml

The python code to generate different variants out of this then also becomes a lot easier

you just extract all the lines starting with #=, remove the marker, but keep the lines behind it, this should give valid yaml
parse that yaml to a py dict (standard libs exist)
reformat that dict to what you like (json, ...)

The resulting into something along these lines

# This is a tsv parameter file for PEMA- containing embeded `#= yaml`  lines to formally describe them
#= about: parameters descriptin for pema vX.Y.Z
#= parameters:

# ----------------  gene
#=     gene:
#=        title: marker gene
#=        description: 'Indicates the marker gene of the study. Currently PEMA supports the analysis of 16S, 12S, 18S, ITS and COI marker genes. Add the name of the marker gene after the underscore ("_")'
#=        type: string
#=        required: true      # need to check but I think yaml boolena values must be "(T|t)rue|(F|f)alse"
#=        validation:
#=          - regex: /^gene_.*$/
#=            description: the value must start with the prefix gene_
# ----------------
gene    gene_16S

# ----------------  targetLength
#=    targetLength:
#=        title: maximum information
#=        description: 'A Trimmomatic parameter. Specifies the read length which is likely to allow the location of the read within the target sequence to be determined.'
#=        integer
#=        required: true
#=        validation:
#=          - value: 16S
#=            description: optional description for this value 
#=          - value: 12S:
#=          - value: 18S:
#=          - value: COI: 
#=          - value: ITS: 
# ----------------
targetLength    18S

# ---------------  pandaseqAlgorithm
#=    pandaseqAlgorithm:
#=        title: minimum length of the amplicon
#=        description: Merging algorithm of the PANDAseq parameter. 
#=        type: string
#=        required: true
#=        default: simple_bayesian
#=        validation:
#=          - value: pear
#=            description: uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide
#=          - value: simple_bayesian
#=            description: uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.
#=          - value: stich
#=          - value: flash 
# ----------------
pandaseqAlgorithm simple_bayesian

# ---------------- threshold
#=    threshold
#=        title: threshold
#=        description: 'Sets the score, that a sequence must meet to be kept in the output. Any alignments lower than this will be discarded as low quality. Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence. It is also used as the threshold to match primers, if primers are supplied.'
#=        type: double
#=        required: true
#=        default: 1.0
#=        validation:
#=          - range:
#=                minimum: 0.0 
#=                maximum: 1.0
# ----------------
threshold  0.6

I think this has the bonus of increasing flexibility and expressiveness, some less typing, far easier parsing logic (using known/existing libraries, just some string filtering on the #=) at the mild cost of counting spaces to have proper yaml

Finally I come to your last question on values of one parameter influencing the properties of another. I again, see your example as a form of validation. And I again recognize how it pushes for more expressiveness and flexibility.

To answer it I would first want us to take a step back and ask again: "What do you expect a form-generator (or anybody else) to do with this information?"

Its role is to assist the end user to the best of its ability to enter meaningful sets of parameters. By doing some early validity-checks we hope to avoid wasting time on starting a run just to wait for a failure message. (aka fail fast strategy)

But we have to understand that this has its limits: the nature of a pre-validation is doing a light variant of the actual (lengthy) thing. The light variant by definition will never be able to check for everything.

So in this particular case we could end up declaring the parameter as not required and hoping for the best. Possibly adding some hints into the description?

If this is not satisfactory, we can reconsider later.

Still, when we do so, we have to realize that the current notation already suggests quite some elaborate functionality to be added into this form-generator module. Already the 'validation' section we introduced:

assumes the generated forms will do validation (!)
suggests three types of validation to consider: option-values, regular expressions, and range values

I think it is acceptable to have thee expectations, as the html5 forms have provisions for these cases:

in fact --> I think the pema-describe.py could easily include support to produce the html-form format itself!

The case of conditional rules however is quite a different case, and one I would just not go into as they would require some kind of programming-ĺanguage-syntax to express the wide range of possible 'conditionals' one would want to check for. (e.g. something like javascript itself - but then some form-data model would need to be agreed upon as interface, or a declarative schema definition language like https://digital-preservation.github.io/csv-schema/csv-schema-1.2.html - but then somebody would need to write a javascript implementation for it to be used in the generated html form)

So. @hariszaf, after all this. What do you think?

can we move to the #= yaml comment inside tsv
move the prefix and values under validation
limit ourselves to validation based on values, regex-patterns and ranges?

If we agree:

you could (sorry) adopt the new syntyax inside the paramaters.tsv
we can start on the py code

(Actually in this new approach it feels generic enough to be handled by its own python module ymlargs

hariszaf commented 2 years ago

Hi Marc and many thanks for the thorough feedback!

can we move to the #= yaml comment inside tsv

There is no problem at all. I see your point for yaml and it is a really nice idea

move the prefix and values under validation

Ok here is the things I wanted to discuss with you. "What do you expect a form-generator (or anybody else) to do with this information?" So, my intention was how can we denote a range of values if there is one? How can we say there is a prefix that needs alway be part of the parameter?

# :targetLength.values: 
#   - 16S: 
#   - 12S:
#   - 18S:
#   - COI: 
#   - ITS: 
#
targetLength    180

This was non sesne and that is why it does not make sense to you. Have a look at this instead:

# :targetLength.values: 
#   - 16S: 150
#   - 12S: 200
#   - 18S: 200
#   - COI: 250
#   - ITS: 180
#
targetLength    180

I was thinking of what Katrina mentioned abou suggesting values in the various cases so maybe it could be something like alternative "bydefaults". If i get this right, the validation would include criteria whether your parameter value will not make pema fail in a deterministic way right? So probably we should not have these values as validation if that is the case.

limit ourselves to validation based on values, regex-patterns and ranges?

definitely yes. This was just a first draft to have something to work with and explain you somehow the challenges. Sorry if i messed things up

marc-portier commented 2 years ago

if I understand correctly you would want people to pick a value from a selection list

this selection is done by label: 16S, 12S, 18S, COI, ITS but the actual selected value in the background is the matching number 150, 200, ...

right?

Well, I agree it is a bit of a stretch but I don't think it entirely wrong to consider "a limited list if accepted values" as some kind of validation, right? The fact that an html form visualizes this affordance as a selection-list doesn't change that nature of "only accepting these values as valid"options" -- Note that the range (min..max) rule might very well be visualized through something like a slider?

anyway, if my assumption holds, I think we could adapt this case maybe to:

# ----------------  targetLength
#=    targetLength:
#=        type: integer
#=        validation:
#=          - label: 16S
#=            value: 150
#=          - label: 12S
#=            value: 200
#=          - label: 18S
#=            value: 200   #is it correct that there are two with 200 ?
#=          - label: COI 
#=            value: 250
#=          - label: ITS 
#=            value: 180
# ----------------
targetLength    180

wdyt?

hariszaf commented 2 years ago

Of.. No i did not say that right..

So the parameter value is always just an intereger; could be between 0 and let's say 500.

But most of the times the value you will give is related to the gene that your data are coming from.

I would say we leave this for now and we are doing everything else the way we discussed.

Then we see if we want to build on that anything else.

I will probably do the convertio to the #= format during the weekend and i ll let you know.

marc-portier commented 2 years ago

cool, from my end I might get started on that generic py lib that can read and convert that yml-tsv combo file... if so I will probably end up in a repo in our space at https://github.com/vliz-be-opsci

hariszaf commented 2 years ago

Cool cool! Feel free to fork or init a repo as it is easier to you.

Many thanks and i ll get back to this asap

hariszaf commented 2 years ago

@marc-portier I have a question. So in its current state, pema does not have actual default parameter values, meaning that if the user leaves a parameter empty there is not a default value that will be used. so i am thinking of removing all default values and having something like suggested-cases or nothing at all.

This way, we could also have the previous values integrated in that key and leave the validation only for tests as the one you suggested. Wouldn't that make more sense?

I was also thinking (again) your question "What do you expect a form-generator (or anybody else) to do with this information?" So, what if we would use this suggested-values section so the IJI platform would pop up this as a message to the user, with some hints on what values would make sense for his/her case.

What do you think?

marc-portier commented 2 years ago

The proper usage / semantics of this default crossed my mind too

Indeed the current parameters.tsv just has provision to store the actual value, and not describe the fall-back-if-not-provided (no value).

So if there is no such bahavior in the code, then we should just drop it, and not add such thing. I just started thinking there was such thing, and wanted to make sure that filling in any chosen value would not actually mask that information if not kept separately.

From that angle I don't think we need suggested-values either -- those would be described in the accepted (validated) values (options) ?

hariszaf / pema