Open hariszaf opened 2 years ago
@hariszaf, upfront:
IMHO, the issue with the formal parameter-description is this
the pema developer decides what are sensible parameters to the process
it is up to the end-user to leverage these by providing actual parameter files for every run
the knowledge transfer from dev to user about "what can each available parameter do for me" is now based on natural text / descriptions being added as # comment-lines
in the provided sample parameters.tsv
The LW-IJI workflow project today has a form / UI that allows the user to edit and configure those exact pema parameters.
This form is created by LW-IJI developers, that learned what parameters exist from reading the parameters.tsv
The challenge though is that there are multiple versions of pema (and more might be coming) and each of those might introduce new parameters.
So the opportunity arises to formally describe all the aspects and properties of the parameters:
to be able to have the LW-IJI consume and apply this form-building information it needs to be provided in a certain well-established machine readable format.
This relates to a number of related questions
Do we stick to providing parameters in tsv format?
Do we keep everything in one file?
pema-describe.py --format «format»
) to convert that into an actual machine consumable format parameter-description.json
next to the existing parameter.tsv
parameter.tsv
file and provide a simple python thingy that converts that into the desired format (json should do)parameter.tsv
sample file that is part of pema?)# :param: maxInfo
# :maxInfo.title: Maximum Information
# :maxInfo.description: Performs an adaptive quality trim, balancing the benefits of retaining longer reads against the costs of retaining bases with errors.
# :maxInfo.type: bool
# :maxInfo.required: Yes
#
maxInfo Yes
pema-describe.py --format json
){ "about": "parameters for version vX.Y.Z",
"params": [
{ "name": "maxInfo",
"title": "Maximum Information",
"description": "Performs an adaptive... errors.",
"required": true,
"default": true
}, ...
Hello there.
ro-crate has less of a direct connection to this
I was thinking a few steps later, we forget about it for now.
the actual format of the parameters file
yes sure! i only mentioned that as we had discussed the bds constraint, so we forget about that too for the time being.
The challenge though is that there are multiple versions of pema (and more might be coming) and each of those might introduce new parameters.
The challenge is well described in your comment and pretty clear to me. :rocket:
Do we stick to providing parameters in tsv format?
It is my belief that if we move from .tsv
to .json
now, it will be a problem for the IJI people.
I would say we stick to .tsv
and we ask them if it they would be ok. json
is an option for pema and that's the good news i tried to share in my initial comment. :smile:
Do we keep everything in one file?
I would say yes, and as you suggest, go for a python scirpt that captures the structured information inside the parameters.tsv
.
I will start working on the parameters.tsv
file following your maxInfo
example.
Once it is ready, I could easily write the pema-describe.py
. If you have something like a prototype or template for that it would be great!
What s your thoughts on that?
If you think that we are ok to go, then my main question is this:
Assuming you have a required parameter A, and based on your selection, then
parameter B might mandatory or optional, how would you denote the required
section for parameter B?
Hi @marc-portier, hi @cpavloud ! :wave:
Would you like to have a look at the parameters_structured.tsv
.
It needs work but i tried to have a first draft so we can see the challenges.
For example, line 337 where a parameter is required if one of a set of marker genes is the one under study, but it is not if another one is selected.
@hariszaf some comments based on your first work
# :gene.prefix: gene_
The question popping up is; "What do you expect a form-generator (or anybody else) to do with this information?"
To me, it reads as a validation rule --> all values need to start with this prefix. In general, I think we need to give 'validation' some more thought. For this specific case, I would express the check as a regular-expression match: /^gene_.*$/
# :targetLength.values:
# - 16S:
# - 12S:
# - 18S:
# - COI:
# - ITS:
#
targetLength 180
Indeed, further down the tsv file I saw you also need additional descriptions to values
# :pandaseqAlgorithm.values:
# - pear:
# description: uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide
# - simple_bayesian:
# description: uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.
# - stich
# - flash
and have even more types of validation:
# :threshold.type: integer
# :theshold.values: 0.0 < threshold < 1.0
threshold 0.6
But worse: These requirements pushes the structure effectively into a nested model. (not a simple flat able any longer) And they simply break the capacities of the simple syntax I started suggesting... This makes me think we should maybe approach this from a different angle altogether? Motivation: I would rather go for something different with better support from the get-go then keep on fixing as we go along?
So. What to think about this new idea:
like that we can interlace (similar to my previous suggestion) the yaml with the tsv, but also introduce a more flexible structure
We then get a simple tsv files with 3 kinds of lines
#
at the start --> simple data line with actual parameter name + value#
at the start --> simple comment line, hidden from the classic tsv processing#=
at the start --> special extra indicator, indicates part of the yamlThe python code to generate different variants out of this then also becomes a lot easier
#=
, remove the marker, but keep the lines behind it, this should give valid yamlThe resulting into something along these lines
# This is a tsv parameter file for PEMA- containing embeded `#= yaml` lines to formally describe them
#= about: parameters descriptin for pema vX.Y.Z
#= parameters:
# ---------------- gene
#= gene:
#= title: marker gene
#= description: 'Indicates the marker gene of the study. Currently PEMA supports the analysis of 16S, 12S, 18S, ITS and COI marker genes. Add the name of the marker gene after the underscore ("_")'
#= type: string
#= required: true # need to check but I think yaml boolena values must be "(T|t)rue|(F|f)alse"
#= validation:
#= - regex: /^gene_.*$/
#= description: the value must start with the prefix gene_
# ----------------
gene gene_16S
# ---------------- targetLength
#= targetLength:
#= title: maximum information
#= description: 'A Trimmomatic parameter. Specifies the read length which is likely to allow the location of the read within the target sequence to be determined.'
#= integer
#= required: true
#= validation:
#= - value: 16S
#= description: optional description for this value
#= - value: 12S:
#= - value: 18S:
#= - value: COI:
#= - value: ITS:
# ----------------
targetLength 18S
# --------------- pandaseqAlgorithm
#= pandaseqAlgorithm:
#= title: minimum length of the amplicon
#= description: Merging algorithm of the PANDAseq parameter.
#= type: string
#= required: true
#= default: simple_bayesian
#= validation:
#= - value: pear
#= description: uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide
#= - value: simple_bayesian
#= description: uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.
#= - value: stich
#= - value: flash
# ----------------
pandaseqAlgorithm simple_bayesian
# ---------------- threshold
#= threshold
#= title: threshold
#= description: 'Sets the score, that a sequence must meet to be kept in the output. Any alignments lower than this will be discarded as low quality. Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence. It is also used as the threshold to match primers, if primers are supplied.'
#= type: double
#= required: true
#= default: 1.0
#= validation:
#= - range:
#= minimum: 0.0
#= maximum: 1.0
# ----------------
threshold 0.6
I think this has the bonus of increasing flexibility and expressiveness, some less typing, far easier parsing logic (using known/existing libraries, just some string filtering on the #=
) at the mild cost of counting spaces to have proper yaml
Finally I come to your last question on values of one parameter influencing the properties of another. I again, see your example as a form of validation. And I again recognize how it pushes for more expressiveness and flexibility.
To answer it I would first want us to take a step back and ask again: "What do you expect a form-generator (or anybody else) to do with this information?"
Its role is to assist the end user to the best of its ability to enter meaningful sets of parameters. By doing some early validity-checks we hope to avoid wasting time on starting a run just to wait for a failure message. (aka fail fast strategy)
But we have to understand that this has its limits: the nature of a pre-validation is doing a light variant of the actual (lengthy) thing. The light variant by definition will never be able to check for everything.
So in this particular case we could end up declaring the parameter as not required and hoping for the best. Possibly adding some hints into the description?
If this is not satisfactory, we can reconsider later.
Still, when we do so, we have to realize that the current notation already suggests quite some elaborate functionality to be added into this form-generator module. Already the 'validation' section we introduced:
I think it is acceptable to have thee expectations, as the html5 forms have provisions for these cases:
in fact --> I think the pema-describe.py could easily include support to produce the html-form format itself!
The case of conditional rules however is quite a different case, and one I would just not go into as they would require some kind of programming-ĺanguage-syntax to express the wide range of possible 'conditionals' one would want to check for. (e.g. something like javascript itself - but then some form-data model would need to be agreed upon as interface, or a declarative schema definition language like https://digital-preservation.github.io/csv-schema/csv-schema-1.2.html - but then somebody would need to write a javascript implementation for it to be used in the generated html form)
So. @hariszaf, after all this. What do you think?
If we agree:
(Actually in this new approach it feels generic enough to be handled by its own python module ymlargs
Hi Marc and many thanks for the thorough feedback!
can we move to the #= yaml comment inside tsv
There is no problem at all. I see your point for yaml and it is a really nice idea
move the
prefix
andvalues
undervalidation
Ok here is the things I wanted to discuss with you. "What do you expect a form-generator (or anybody else) to do with this information?" So, my intention was how can we denote a range of values if there is one? How can we say there is a prefix that needs alway be part of the parameter?
# :targetLength.values:
# - 16S:
# - 12S:
# - 18S:
# - COI:
# - ITS:
#
targetLength 180
This was non sesne and that is why it does not make sense to you. Have a look at this instead:
# :targetLength.values:
# - 16S: 150
# - 12S: 200
# - 18S: 200
# - COI: 250
# - ITS: 180
#
targetLength 180
I was thinking of what Katrina mentioned abou suggesting values in the various cases so maybe it could be something like alternative "bydefaults". If i get this right, the validation
would include criteria whether your parameter value will not make pema fail in a deterministic way right? So probably we should not have these values as validation
if that is the case.
limit ourselves to validation based on values, regex-patterns and ranges?
definitely yes. This was just a first draft to have something to work with and explain you somehow the challenges. Sorry if i messed things up
if I understand correctly you would want people to pick a value from a selection list
this selection is done by label: 16S, 12S, 18S, COI, ITS but the actual selected value in the background is the matching number 150, 200, ...
right?
Well, I agree it is a bit of a stretch but I don't think it entirely wrong to consider "a limited list if accepted values" as some kind of validation, right? The fact that an html form visualizes this affordance as a selection-list doesn't change that nature of "only accepting these values as valid"options" -- Note that the range (min..max) rule might very well be visualized through something like a slider?
anyway, if my assumption holds, I think we could adapt this case maybe to:
# ---------------- targetLength
#= targetLength:
#= type: integer
#= validation:
#= - label: 16S
#= value: 150
#= - label: 12S
#= value: 200
#= - label: 18S
#= value: 200 #is it correct that there are two with 200 ?
#= - label: COI
#= value: 250
#= - label: ITS
#= value: 180
# ----------------
targetLength 180
wdyt?
Of.. No i did not say that right..
So the parameter value is always just an intereger; could be between 0 and let's say 500.
But most of the times the value you will give is related to the gene that your data are coming from.
I would say we leave this for now and we are doing everything else the way we discussed.
Then we see if we want to build on that anything else.
I will probably do the convertio to the #=
format during the weekend and i ll let you know.
cool, from my end I might get started on that generic py lib that can read and convert that yml-tsv combo file... if so I will probably end up in a repo in our space at https://github.com/vliz-be-opsci
Cool cool! Feel free to fork or init a repo as it is easier to you.
Many thanks and i ll get back to this asap
@marc-portier I have a question.
So in its current state, pema does not have actual default
parameter values, meaning that if the user leaves a parameter empty there is not a default value that will be used.
so i am thinking of removing all default
values and having something like suggested-cases
or nothing at all.
This way, we could also have the previous values
integrated in that key and leave the validation
only for tests as the one you suggested.
Wouldn't that make more sense?
I was also thinking (again) your question "What do you expect a form-generator (or anybody else) to do with this information?"
So, what if we would use this suggested-values
section so the IJI platform would pop up this as a message to the user, with some hints on what values would make sense for his/her case.
What do you think?
The proper usage / semantics of this default
crossed my mind too
Indeed the current parameters.tsv just has provision to store the actual value, and not describe the fall-back-if-not-provided (no value).
So if there is no such bahavior in the code, then we should just drop it, and not add such thing. I just started thinking there was such thing, and wanted to make sure that filling in any chosen value would not actually mask that information if not kept separately.
From that angle I don't think we need suggested-values
either -- those would be described in the accepted (validated) values (options) ?
This issue is about describing the parameter file that is required to run PEMA in a machine-interoperable way: describing the formats and including defaults for all entries in the file, for example Work has begun on this, but it needs a review and completion
@marc-portier BigDataScript supports reading
json
files; see here.The
readParameterFile
function ofpema
reads the.tsv
parameters file line-by-line to return a bds "dictionary".We could edit this function to read a
.json
file instead and have this.json
file in an RO-Crate oriented way. :sunglasses: