Open kverstae opened 4 months ago
Hi @kverstae, thanks for opening this issue! Just for the sake of convenience, I wanted to mention that issue #619 is closely related to this one.
Hey @kverstae ! Thanks for creating this issue!
I've been thinking about this for a bit, specifically how to implement such functionality without rewriting everything from scratch and introducing very different behaviours and functionality.
Since values are passed in a delimiter-based way (e g. Delimiter is ;
), it's important to be able to distinguish between empty lists and lists with an "empty" value. As such, I would propose the following:
Integers, booleans and doubles can become nullable. For this to happen, an empty value needs to be passed -- that is ""
. When multiple is true, the same is still possible but an extra delimiter needs to be added. E g. "1;;3;4;
is interpreted as [1, NA, 3, 4]
, and "1;;;"
as [1, NA, NA]
. For the sake of backwards compatibility, if a string ends with a non-empty value without the presence of a separator, it should also work -- e.g. "1;2"
becomes [1, 2]
.
For strings, an empty value is an empty string. I'm inclined to propose to not differentiate between a missing value, because we would then need to introduce a keyword to denote the missing-ness, which woule introduce breaking changes.
We'd have to look into how each value gets translated into each of the different languages. In a follow-up post, we should outline a cross comparison of different examples and how it gets parsed into each of the different languages so we can build a unit test out of it.
Would that satisfy your needs @kverstae ?
Thanks for taking a look at this @rcannood
Your proposal would indeed satisfy my needs.
I personally am not a big fan of the trailing delimiter. If I would just look at 1;;3;4;
, I would interpret this as [1, NA, 3, 4, NA]
, since this is also how 'normal' delimited files (csv/tsv/...) work.
This, however is just a small detail and I wouldn't mind if it was needed.
For strings, you could argue that when multiple is true and you get something like foo;;bar
it would make sense to translate it into ["foo", NA, "bar"]
to be in line with integers/booleans/doubles. This would indeed break backwards compatibility, but could maybe be 'fixed' by defining a new parameter in _viash.yml
to specify how empty string values should be parsed: ""
or NA
/null
/...
Come to think of it, my proposal would cause an issue. Let me set up a 'test bench' to clarify.
There are multiple way of specifying arguments -- either from the CLI, in Groovy, or as a param_list csv / yaml.
Let us take the following list of arguments:
name: mycomp
arguments:
- type: integer
name: --optional_integer
default: 42
required: false
- type: string
name: --multiple_string
required: false
multiple: true
allow_missing: true # new field added
(I could add all possible variants of required arguments, optional arguments, multiple arguments; for types string, integer, double, boolean, but it would be just more of the same).
Let's say we want our channel to contain three events:
[
[ "event0", [optional_integer: 42, multiple_string: ["a", "b", "c"] ],
[ "event1", [optional_integer: 44] ],
[ "event2", [multiple_string: [] ],
[ "event3", [optional_integer: 42, multiple_string: ["b", null]] ]
]
In Nextflow / Groovy, this is already possible. We can just create a channel with exactly these parameters and presto. However, to ensure that the default value of optional_integer
is not used, we need to explicitly pass null
to this argument.
Channel.fromList([
// optional_integer is omitted, since it's the default
[ "event0", [multiple_string: ["a", "b", "c"] ],
// multiple_string is omitted, since the default is already what we want it to be
[ "event1", [optional_integer: 44] ],
// explicitly unset `optional_integer`, and explicitly set `multiple_string` to an empty list
[ "event2", [optional_integer: null, multiple_string: [] ],
// straight forward
[ "event3", [multiple_string: ["b", null]] ]
])
| mycomp.run(
fromState: ["optional_integer", "multiple_string"],
toState: [...]
)
All Viash workflows can also ingest a parameter list (csv, json, or yaml) using the param_list
argument. For instance:
./params.yaml
:
param_list:
- id: event0
multiple_string: ["a", "b", "c"]
- id: event1
optional_integer: 44
- id: event2
optional_integer: null
multiple_string: []
- id: event3
multiple_string: ["b", null]
and then running the component with nextflow run . -main-script target/nextflow/mycomp/main.nf -params-file ./params.yaml
should yield exactly the same result.
However, using a csv based parameter list causes issues here, because we can define the following csv:
id,optional_integer,multiple_string
event0,42,a;b;c
event1,44,NULL
event2,NULL,
event3,42,b;NULL
However, there are three issues with this:
There is currently no way to explicitly unset an argument (Related to #619). Let's add a reserved keyword like NULL, MISSING, NO_VALUE, UNSET.
As-is, it's not possible to distinguish between whether the value for multiple_string
in event2
represents []
or [""]
. To this extend, I would propose the separator notation discussed earlier:
` →
[]`;
→ [""]
foo;
→ ["foo"]
: separator just denotes the end of an elementfoo;bar
→ ["foo", "bar"]
: separator at the end is implicitfoo;;
→ ["foo", ""]
: extra separator denotes an empty string in the second valueWhen it comes to unsetting values:
foo;NULL;
→ ["foo", null]
: don't forget about missing valuesNULL
→ null
: unset the argumentNULL;
→ [null]
: A list containing a single missing valueWhen calling an executable from the CLI, we can apply the same parsing as with the CSV, except the user can also call the same argument multiple times. Specifically, the following should be valid:
param_list:
- id: event0
multiple_string: ["a", "b", "c"]
- id: event1
optional_integer: 44
- id: event2
optional_integer: null
multiple_string: []
- id: event3
multiple_string: ["b", null]
# event 0
target/executable/mycomp/mycomp \
--multiple_string a --multiple_string b --multiple_string c
# event 1
target/executable/mycomp/mycomp \
--optional_integer 44
# event 2
target/executable/mycomp/mycomp \
--optional_integer "NULL" \
--multiple_string ";"
# event 3
target/executable/mycomp/mycomp \
--multiple_string "b" --multiple_string "NULL"
However, the following should also be valid
# event 0
target/executable/mycomp/mycomp \
--multiple_string "a;b;c"
# event 1
target/executable/mycomp/mycomp \
--optional_integer 44
# event 2
target/executable/mycomp/mycomp \
--optional_integer "NULL" \
--multiple_string ";"
# event 3
target/executable/mycomp/mycomp \
--multiple_string "b;NULL"
And the latter should also work using the Nextflow CLI:
# event 1
nextflow run . -main-script target/nextflow/mycomp.main.nf \
--multiple_string "a;b;c"
# etc...
# event0
par <- list(optional_integer = 42L, multiple_string = c("a", "b", "c"))
# event 1
par <- list(optional_integer = 44L, multiple_string = NULL) # breaking change
# event 2
par <- list(optional_integer = NULL, multiple_string = character(0))
# event 3
par <- list(optional_integer = 42L, multiple_string = c("b", NA_character_))
# event0
par = {"optional_integer": 42, multiple_string: ["a", "b", "c"]}
# event 1
par = {"optional_integer" : 44, multiple_string: None} # breaking change
# event 2
par = {"optional_integer": None, multiple_string: []}
# event 3
par = {"optional_integer": 42, multiple_string: ["b", None]}
// todo
// todo
// todo
# event0
par_optional_integer=42
par_multiple_string="a;b;c;" # breaking change?
# event1
par_optional_integer=44
# par_multiple_string is unset
# event2
# par_optional_string is unset
par_multiple_string=""
# event3
par_optional_string=42
par_multiple_string="b;NULL;"
Though I would strongly recommend doing advanced processing with nullable multiple arguments in Bash :sweat_smile:
In conclusion, I propose the following two extensions:
NULL
. It can also be used to set missing values in multiple arguments e.g. b;NULL;
.Use trailing delimiter to to distinguish between an empty list []
and a list with an empty string [""]
. Some specific examples:
` →
[]`;
→ [""]
foo;
→ ["foo"]
: separator just denotes the end of an elementfoo;bar
→ ["foo", "bar"]
: separator at the end is implicitfoo;;
→ ["foo", ""]
: extra separator denotes an empty string in the second valueA multiple argument with no default value results in an unset value instead of an empty list. Examples:
unset par_arg
par <- list(arg = NULL)
par = {"arg": None}
@DriesSchaumont Do you have any comments regarding the usability of this? And the proposed breaking changes? @Grifs Are there any technical issues I'm overlooking?
Just two questions to clarify:
--multiple_string ''
would do in your proposal? Does this equal `` → []
?A general remark regarding making breaking changes: I think its OK on the condition that we introduce a viash config value to override the 'not set' value (as suggested in https://github.com/viash-io/viash/issues/619#issuecomment-2124576166) because this would allow setting this config value to avoid making changes to the scripts. In 0.9.0
we will introduce breaking changes to the config already, but if we can make sure that the breaking changes can be resolved by adjusting configs only I think thats a better situation than needing to adjust both scripts and configs.
- If I understood correctly, in your last proposal there will be no more differences in functionality between the different argument type with respect to the behavior of the 'null' value, correct (as opposed to the first proposal in [FEATURE] allow null in integer format when multiple: true #705
Indeed!
- Could you clarify what the case --multiple_string '' would do in your proposal? Does this equal `` → []?
With the latest proposal:
--multiple_string NULL
→ None
--multiple_string ""
→ []
--multiple_string ";"
→ [""]
--multiple_string "foo;"
→ ["foo"]
: separator just denotes the end of an element--multiple_string "foo;bar"
→ ["foo", "bar"]
: separator at the end is implicit--multiple_string "foo;;"
→ ["foo", ""]
: extra separator denotes an empty string in the second valueCome to think of it, being able to distinguish between []
and [""]
could also be solved by introducing yet another keyword. For instance, NULL
, NONE
, or UNDEFINED
could map to None
, and NIL
, or EMPTY_LIST
could map to []
. In that case, ""
would still map to [""]
and would thus result in fewer breaking change.
Maybe I prefer UNDEFINED
and EMPTY_LIST
better as it is more explicit.
- If I understood correctly, in your last proposal there will be no more differences in functionality between the different argument type with respect to the behavior of the 'null' value, correct (as opposed to the first proposal in [FEATURE] allow null in integer format when multiple: true [FEATURE] allow
null
in integer format whenmultiple: true
#705Indeed!
- Could you clarify what the case --multiple_string '' would do in your proposal? Does this equal `` → []?
With the latest proposal:
* `--multiple_string NULL` → `None` * `--multiple_string ""` → `[]` * `--multiple_string ";"` → `[""]` * `--multiple_string "foo;"` → `["foo"]`: separator just denotes the end of an element * `--multiple_string "foo;bar"` → `["foo", "bar"]`: separator at the end is implicit * `--multiple_string "foo;;"` → `["foo", ""]`: extra separator denotes an empty string in the second value
I concur with the remark from @kverstae that the use of the separator is counterintuitive here. Am I overlooking something when I say that this could also work?
--multiple_string NULL
→ None
--multiple_string ""
→ []
--multiple_string '""'
→ [""]
--multiple_string ";"
→ ["", ""]
--multiple_string "foo"
→ ["foo"]
--multiple_string "foo;"
→ ["foo", ""]
--multiple_string "foo;bar"
→ ["foo", "bar"]
--multiple_string "foo;;"
→ ["foo", "", ""]
Come to think of it, being able to distinguish between
[]
and[""]
could also be solved by introducing yet another keyword. For instance,NULL
,NONE
, orUNDEFINED
could map toNone
, andNIL
, orEMPTY_LIST
could map to[]
. In that case,""
would still map to[""]
and would thus result in fewer breaking change.Maybe I prefer
UNDEFINED
andEMPTY_LIST
better as it is more explicit.
I'm not in favour of adding another keyword. I think we will be introducing breaking changes anyway, and introducing fewer breaking changes does not warrant the complexity overhead of adding another keyword.
I'm not in favour of adding another keyword.
Yes, but by adding ""
→ []
and '""'
→ [""]
, you're effectively adding a keyword, only that the naming isn't very explicit :stuck_out_tongue:
All kidding aside, we could take your quote idea and extend it
--multiple_string UNDEFINED
→ None
--multiple_string ""
→ []
--multiple_string '""'
→ [""]
--multiple_string ";"
→ ["", ""]
--multiple_string ';"";'
→ ["", "", ""]
--multiple_string "foo"
→ ["foo"]
--multiple_string "foo;bar"
→ ["foo", "bar"]
--multiple_string "foo;;"
→ ["foo", "", ""]
This allows users to pass the 'UNDEFINED' string by escaping it:
--multiple_string '"UNDEFINED"'
→ ["UNDEFINED"]
Furthermore, we should then also allow escaping separator, quotes and slashes:
--multiple_string '"escape;sep"'
→ ["escape;sep"]
--multiple_string '"escape\"quote"'
→ ['escape"quote']
--multiple_string '"escape\\slash"'
→ ["escape\slash"]
However, what happens to characters which need to be escaped, when the string is not surrounded by quotes?
--multiple_string "escape\;sep"
→ ["escape;sep"]
--multiple_string '"escape\;sep"'
→ ["escape;sep"]
--multiple_string 'noneedto"quote'
→ ['noneedto"quote']
--multiple_string "escape\\slash"'
→ ["escape\slash"]
OHNOES! I came across two more edge cases
We need to be able to distinguish between None
and [None]
. with the proposal above, we can turn UNDEFINED
into None
and foo;UNDEFINED;bar
into ["foo", None, "bar"]
, but it's not possible to create a list only containing a missing value -- i.e. [None]
.
I think the only thing we can do is define another keyword MISSING
that specifically denotes a missing value in a list. Since UNDEFINED
and MISSING
sound a bit too similar, I propose renaming UNDEFINED
to NOT_SET
.
multiple: false
) also allowed to process MISSING values?In R, it might make sense interpret CLI values as follows:
--optional_arg NOT_SET
→ par <- list(optional_arg = NULL)
--optional_arg MISSING
→ par <- list(optional_arg = NA_character_)
However, in Python both cases would translate into {"optional_arg": None}
. The same is true for JavaScript.
One solution might be to allow MISSING
values only when multiple: true
, and NOT_SET
values only when required: false
(I already assumed the latter was already the case anyways).
I'm going to update my previous post with the latest updates.
There are multiple way of specifying arguments -- either from the CLI, in Groovy, or as a param_list csv / yaml.
Right now, passing arguments directly in Groovy or in a YAML param_list offers the greatest flexibility in distinguishing between things:
None
)[]
)[""]
)[None]
or ["a", "b", None, "c"]
)This is due to the command-line interface having to pass values as strings (e.g. --my_multiple_arg ""
) and the CLI not being able to distinguish between the various edge cases. The following proposal aims to change this by introducing a few keywords, quoting, and the ability to escape certain characters.
Let us take the following list of arguments:
name: mycomp
arguments:
- type: integer
name: --optional_integer
default: 42
required: false
- type: string
name: --multiple_string
required: false
multiple: true
allow_missing: true # new field added
(I could add all possible variants of required arguments, optional arguments, multiple arguments; for types string, integer, double, boolean, but it would be just more of the same).
Let's say we want our channel to contain three events:
[
[ "event0", [optional_integer: 42, multiple_string: ["a", null, "c"] ],
[ "event1", [optional_integer: 44, multiple_string: null] ],
[ "event2", [optional_integer: null, multiple_string: [] ],
[ "event3", [optional_integer: 42, multiple_string: [null]] ]
]
In Nextflow / Groovy, this is already possible. We can just create a channel with exactly these parameters and presto. However, to ensure that the default value of optional_integer
is not used, we need to explicitly pass null
to this argument.
Channel.fromList([
// optional_integer can be omitted, since it's the default
[ "event0", [multiple_string: ["a", null, "c"] ],
// multiple_string can be omitted, since the default is already what we want it to be
[ "event1", [optional_integer: 44] ],
// explicitly unset `optional_integer`, and explicitly set `multiple_string` to an empty list
[ "event2", [optional_integer: null, multiple_string: [] ],
// optional_integer can be omitted
[ "event3", [multiple_string: [null]] ]
])
| mycomp.run(
fromState: ["optional_integer", "multiple_string"],
toState: [...]
)
All Viash workflows can also ingest a parameter list (csv, json, or yaml) using the param_list
argument. For instance:
./params.yaml
:
param_list:
- id: event0
multiple_string: ["a", null, "c"]
- id: event1
optional_integer: 44
- id: event2
optional_integer: null
multiple_string: []
- id: event3
multiple_string: [null]
and then running the component with nextflow run . -main-script target/nextflow/mycomp/main.nf -params-file ./params.yaml
should yield exactly the same result.
However, using a csv based parameter list causes issues here, because we can define the following csv:
id,optional_integer,multiple_string
event0,42,a;MISSING;c
event1,44,NOT_SET
event2,NOT_SET,
event3,42,MISSING
This introduces different notations.
We should be able to distinguish between an empty list ([]
) or a list of an empty string ([""]
). We can do so by allowing the user to quote the empty string:
` →
[]`""
→ [""]
We should be able to unset an argument (e.g. when it is optional but it already has a default). We can add a reserved keyword (NOT_SET
) for it, but the user can also quote it to pass the original string.
NOT_SET
→ None
"NOT_SET"
→ ["NOT_SET"]
We should be able to pass values containing the separator. Proposal:
;
→ ["", ""]
";"
→ [";"]
';"";'
→ ["", "", ""]
We should also be able to pass strings containing quotes. Therefore, we should be able to escape quotes (\"
), and therefore also need to be able to escape the escape character (\\
). While we're at it, we should then also be able to escape separators (\;
). For CSV files, we should also be able to escape the csv separator (\,
)
"quote"
→ ["quote"]
\"noquote\"
→ ['"quote"']
escape\\slash
→ ["escape\slash"]
noneedto"quote
→ ['noneedto"quote']
escape\;sep
→ ["escape;sep"]
escape\,sep
→ ["escape,sep"]
Note that other combinations of escaped characters (e.g. \n
) should remain unchanged.
When calling an executable from the CLI, we can apply the same parsing as with the CSV, except the user can also call the same argument multiple times. Specifically, the following should be valid:
# event 0
target/executable/mycomp/mycomp \
--multiple_string a --multiple_string MISSING --multiple_string c
# event 1
target/executable/mycomp/mycomp \
--optional_integer 44
# event 2
target/executable/mycomp/mycomp \
--optional_integer NOT_SET \
--multiple_string ""
# event 3
target/executable/mycomp/mycomp \
--multiple_string MISSING
However, the following should also be valid
# event 0 alternative
target/executable/mycomp/mycomp \
--multiple_string "a;MISSING;c"
# event 0 yet another alternative
target/executable/mycomp/mycomp \
--multiple_string "a" --multiple_string "MISSING;c"
And the latter should also work using the Nextflow CLI:
# event 0
nextflow run . -main-script target/nextflow/mycomp/main.nf \
--multiple_string "a;MISSING;c"
# event 0
par <- list(optional_integer = 42L, multiple_string = c("a", NA_character_, "c"))
# event 1
par <- list(optional_integer = 44L, multiple_string = NULL) # breaking change
# event 2
par <- list(optional_integer = NULL, multiple_string = character(0))
# event 3
par <- list(optional_integer = 42L, multiple_string = c(NA_character_))
# event 0
par = {"optional_integer": 42, "multiple_string": ["a", None, "c"]}
# event 1
par = {"optional_integer" : 44, "multiple_string": None} # breaking change
# event 2
par = {"optional_integer": None, "multiple_string": []} # breaking change
# event 3
par = {"optional_integer": 42, "multiple_string": [None]}
// event 0
const par = {"optional_integer": 42, "multiple_string": ["a", undefined, "c"]}
// event 1
const par = {"optional_integer" : 44, "multiple_string": undefined} // breaking change
// event 2
const par = {"optional_integer": undefined, "multiple_string": []} // breaking change
// event 3
const par = {"optional_integer": 42, "multiple_string": [undefined]}
// event 0
var par = new { optional_integer = 42, multiple_string = new string[] { "a", null, "c" } };
// event 1
var par = new { optional_integer = 44, multiple_string = (string[])null };
// event 2
var par = new { optional_integer = (int?)null, multiple_string = new string[0] };
// event 3
var par = new { optional_integer = 42, multiple_string = new string[] { null } };
case class Par (
optional_integer: Some[Int],
multiple_string: Some[List[String]] // breaking change
)
// event 0
val par = Par(optional_integer = Some(42), multiple_string = Some(List("a", null, "c")))
// event 1
val par = Par(optional_integer = Some(44), multiple_string = None)
// event 2
val par = Par(optional_integer = None, multiple_string = Some(List()))
// event 3
val par = Par(optional_integer = Some(42), multiple_string = Some(List(null)))
# event0
par_optional_integer=42
par_multiple_string="a;MISSING;c"
# event1
par_optional_integer=44
# par_multiple_string is unset
# event2
# par_optional_integer is unset
par_multiple_string=""
# event3
par_optional_integer=42
par_multiple_string="MISSING"
The following set of changes are proposed.
NOT_SET
and MISSING
are now reserved keywords for arguments when using a component from the CLI. Use quotes to pass the actual value to a component:
--optional_integer 'NOT_SET'
→ None
--optional_string '"NOT_SET"'
→ "NOT_SET"
--optional_integer MISSING
→ None
--multiple_string 'NOT_SET'
→ None
--multiple_string '"NOT_SET"'
→ ["NOT_SET"]
--multiple_string MISSING
→ [None]
When using a Python-based or JavaScript-based component from the CLI, passing an empty string to an argument with multiple: true
now turns into an empty list. Pass a quoted string to get an empty list:
--multiple_string ""
→ []
--multiple_string '""'
→ [""]
When using the CLI, any values starting and ending with a quote will be considered a single string and be stripped of their quotes:
--multiple_string 'foo;bar;baz'
→ ["foo", "bar", "baz"]
--multiple_string '"foo;bar;baz"'
→ ["foo;bar;baz"]
When an argument with multiple: true
is not passed a value, the resulting value is None
and not an empty list. With --multiple NOT_SET
:
par <- list(multiple = character(0))
. Now: par <- list(multiple = NULL)
par = {"multiple": []}
. Now: par = {"multiple": None}
const par = {"multiple": []}
. Now: const par = {"multiple": undefined}
par_multiple=""
. Now: # par_multiple is not set
.val par = Par(multiple = Nil)
. Now: val par = Par(multiple = None)
.Allow distinguishing between an empty list ([]
) or a list of an empty string ([""]
). This can be achieved by quoting an empty string:
--my_multiple_arg ""
→ []
--my_multiple_arg '""'
→ [""]
Allow unsetting an argument (e.g. when it is optional but it already has a default) by introducing a reserved keyword (NOT_SET
) for it. Use quotes to reproduce the original behaviour:
--my_multiple_arg NOT_SET
→ None
--my_multiple_arg '"NOT_SET"'
→ ["NOT_SET"]
Allow passing missing values by introducing a reserved keyword (MISSING
) for it. Use quotes to reproduce the original behaviour:
--my_multiple_arg MISSING
→ [None]
--my_multiple_arg '"MISSING"' →
["MISSING"]`--my_multiple_arg 'a;MISSING;c' →
["a", None, "c"]`When using the CLI, allow passing values containing the separator:
--my_multiple_arg ';'
→ ["", ""]
--my_multiple_arg '";"'
→ [";"]
--my_multiple_arg '";"";'
→ ["", "", ""]
When using the CLI, allow passing strings containing quotes. You can escape quotes (\"
), separators (\;
), and the escape character (\\
).:
--my_multiple_arg '"quote"'
→ ["quote"]
--my_multiple_arg '\"noquote\"'
→ ['"quote"']
--my_multiple_arg 'escape\\slash'
→ ["escape\slash"]
--my_multiple_arg 'noneedto"quote'
→ ['noneedto"quote']
--my_multiple_arg 'escape\;sep'
→ ["escape;sep"]
When using a CSV as a param_list, allow escape the comma separator using quotes (","
) or backslashes (\,
). Example:
arg1,arg2
foo,bar
"fo,o",bar
fo\,o,bar
I would personally prefer --multiple_string ""
→ [""]
rather than []
because intuitively you only provide 1 (empty) argument in the list IMO. I understand though that this causes inconsistencies elsewhere.
A question: when you pass --single_opt 'foo'
as a parameter on the shell, does the shell not automatically strip the '
? Or in other words, are we able to distinguish between, e.g. 'MISSING'
, "MISSING"
and MISSING
?
Why would we use NOT_SET
as a keyword and not NA
which to me is more intuitive?
Am I right in thinking that MISSING
for an entry in a multiple element is the same as NOT_SET
for the element itselft? Both translate to None
or am I wrong? Could this be simplified? I guess not because in the examples above both point to different values.
I'm afraid this will be hard to remember. But then again, we we are considering edge cases I guess?
One suggestion I could make is to use something like NOT_SET
for the main argument and NOT_SET_ITEM
for the a list item [^1].
[^1]: But then use NA
instead of NOT_SET
🤷
I would personally prefer
--multiple_string ""
→[""]
rather than[]
because intuitively you only provide 1 (empty) argument in the list IMO. I understand though that this causes inconsistencies elsewhere.
From the perspective of the shell, I find you example counterintuitive because:
$ echo '""'
""
$ echo ''
$ echo ""
Also, a second issue arises from your poposed syntax: how to specify an empty list?
A question: when you pass
--single_opt 'foo'
as a parameter on the shell, does the shell not automatically strip the'
? Or in other words, are we able to distinguish between, e.g.'MISSING'
,"MISSING"
andMISSING
?
It does (see examples above). As an extra note, you can pass whatever you would like given the proper escaping
$ echo "\"MISSING"\"
"MISSING
Why would we use
NOT_SET
as a keyword and notNA
which to me is more intuitive? Am I right in thinking thatMISSING
for an entry in a multiple element is the same asNOT_SET
for the element itselft? Both translate toNone
or am I wrong? Could this be simplified? I guess not because in the examples above both point to different values. I'm afraid this will be hard to remember. But then again, we we are considering edge cases I guess? One suggestion I could make is to use something likeNOT_SET
for the main argument andNOT_SET_ITEM
for the a list item 1.Footnotes
1. But then use `NA` instead of `NOT_SET` 🤷 [↩](#user-content-fnref-1-fe46bbd51c562bc33419e0e2e876499a)
The reason why we need two values is because of an edge case Robrecht described above: we need to be able to distinguish between None
and [None]
.
Given only one value
--multiple_string SOME_SPECIAL_VALUE
→ None
--multiple_string ""
→ []
--multiple_string '""'
→ [""]
--multiple_string "SOME_SPECIAL_VALUE"
→ ["SOME_SPECIAL_VALUE"]
There is no real way to define [None]
.
Given only one value
--multiple_string SOME_SPECIAL_VALUE
→None
--multiple_string ""
→[]
--multiple_string '""'
→[""]
--multiple_string "SOME_SPECIAL_VALUE"
→["SOME_SPECIAL_VALUE"]
There is no real way to define
[None]
.
Hence my suggestion:
--multiple_string NOT_SET
→ None
--multiple_string NOT_SET_ITEM
→ [None]
For me, the _ITEM
addition makes clear an item in the list is missing rather than the list itself. I find that having that connotation makes it clearer than MISSING
.
I was at first strongly opposed to using the same name (NOT_SET
and NOT_SET_ITEM
) for, what felt like, two very distinct things. However, the more that I thought about how to formulate how I feel that the two concepts are different, the more I realised that this is probably just in my head.
I don't really like NOT_SET_ITEM
because it's quite a mouthful.
I propose we borrowing from terminology used in other programming languages. For example:
NONE
and NONE_ITEM
(or ITEM_NONE
)UNDEFINED
and UNDEFINED_ITEM
(or ITEM_UNDEFINED
)UNSET
and UNSET_ITEM
(or ITEM_UNSET
)NULL
and NULL_ITEM
(or ITEM_NULL
)Taking into account the popularity of each of these languages, I guess NONE
and NULL
are the best options. Whether we make ITEM
(or ELEMENT
?) a prefix or a suffix is another question.
Happy to pick this up during a frodo to discuss!
UNDEFINED
en UNDEFINED_ITEM
!
Final proposal:
I'm going to update my previous post with the latest updates.
There are multiple way of specifying arguments -- either from the CLI, in Groovy, or as a param_list csv / yaml.
Right now, passing arguments directly in Groovy or in a YAML param_list offers the greatest flexibility in distinguishing between things:
None
)[]
)[""]
)[None]
or ["a", "b", None, "c"]
)This is due to the command-line interface having to pass values as strings (e.g. --my_multiple_arg ""
) and the CLI not being able to distinguish between the various edge cases. The following proposal aims to change this by introducing a few keywords, quoting, and the ability to escape certain characters.
Let us take the following list of arguments:
name: mycomp
arguments:
- type: integer
name: --optional_integer
default: 42
required: false
- type: string
name: --multiple_string
required: false
multiple: true
allow_UNDEFINED_ITEM: true # new field added
(I could add all possible variants of required arguments, optional arguments, multiple arguments; for types string, integer, double, boolean, but it would be just more of the same).
Let's say we want our channel to contain three events:
[
[ "event0", [optional_integer: 42, multiple_string: ["a", null, "c"] ],
[ "event1", [optional_integer: 44, multiple_string: null] ],
[ "event2", [optional_integer: null, multiple_string: [] ],
[ "event3", [optional_integer: 42, multiple_string: [null]] ]
]
In Nextflow / Groovy, this is already possible. We can just create a channel with exactly these parameters and presto. However, to ensure that the default value of optional_integer
is not used, we need to explicitly pass null
to this argument.
Channel.fromList([
// optional_integer can be omitted, since it's the default
[ "event0", [multiple_string: ["a", null, "c"] ],
// multiple_string can be omitted, since the default is already what we want it to be
[ "event1", [optional_integer: 44] ],
// explicitly unset `optional_integer`, and explicitly set `multiple_string` to an empty list
[ "event2", [optional_integer: null, multiple_string: [] ],
// optional_integer can be omitted
[ "event3", [multiple_string: [null]] ]
])
| mycomp.run(
fromState: ["optional_integer", "multiple_string"],
toState: [...]
)
All Viash workflows can also ingest a parameter list (csv, json, or yaml) using the param_list
argument. For instance:
./params.yaml
:
param_list:
- id: event0
multiple_string: ["a", null, "c"]
- id: event1
optional_integer: 44
- id: event2
optional_integer: null
multiple_string: []
- id: event3
multiple_string: [null]
and then running the component with nextflow run . -main-script target/nextflow/mycomp/main.nf -params-file ./params.yaml
should yield exactly the same result.
However, using a csv based parameter list causes issues here, because we can define the following csv:
id,optional_integer,multiple_string
event0,42,a;UNDEFINED_ITEM;c
event1,44,UNDEFINED
event2,UNDEFINED,
event3,42,UNDEFINED_ITEM
This introduces different notations.
We should be able to distinguish between an empty list ([]
) or a list of an empty string ([""]
). We can do so by allowing the user to quote the empty string:
` →
[]`""
→ [""]
We should be able to unset an argument (e.g. when it is optional but it already has a default). We can add a reserved keyword (UNDEFINED
) for it, but the user can also quote it to pass the original string.
UNDEFINED
→ None
"UNDEFINED"
→ ["UNDEFINED"]
We should be able to pass values containing the separator. Proposal:
;
→ ["", ""]
";"
→ [";"]
';"";'
→ ["", "", ""]
We should also be able to pass strings containing quotes. Therefore, we should be able to escape quotes (\"
), and therefore also need to be able to escape the escape character (\\
). While we're at it, we should then also be able to escape separators (\;
). For CSV files, we should also be able to escape the csv separator (\,
)
"quote"
→ ["quote"]
\"noquote\"
→ ['"quote"']
escape\\slash
→ ["escape\slash"]
noneedto"quote
→ ['noneedto"quote']
escape\;sep
→ ["escape;sep"]
escape\,sep
→ ["escape,sep"]
Note that other combinations of escaped characters (e.g. \n
) should remain unchanged.
When calling an executable from the CLI, we can apply the same parsing as with the CSV, except the user can also call the same argument multiple times.
Note that the examples below only show target/executable/mycomp/mycomp
, but the same also applies when running nextflow components via the nextflow CLI.
Specifically, the following should be valid:
# event 0
target/executable/mycomp/mycomp \
--multiple_string a --multiple_string UNDEFINED_ITEM --multiple_string c
# event 1
target/executable/mycomp/mycomp \
--optional_integer 44
# event 2
target/executable/mycomp/mycomp \
--optional_integer UNDEFINED \
--multiple_string ""
# event 3
target/executable/mycomp/mycomp \
--multiple_string UNDEFINED_ITEM
However, the following should also be valid
# event 0 alternative
target/executable/mycomp/mycomp \
--multiple_string "a;UNDEFINED_ITEM;c"
# event 0 yet another alternative
target/executable/mycomp/mycomp \
--multiple_string "a" --multiple_string "UNDEFINED_ITEM;c"
And the latter should also work using the Nextflow CLI:
# event 0
nextflow run . -main-script target/nextflow/mycomp/main.nf \
--multiple_string "a;UNDEFINED_ITEM;c"
# event 0
par <- list(optional_integer = 42L, multiple_string = c("a", NA_character_, "c"))
# event 1
par <- list(optional_integer = 44L, multiple_string = NULL) # breaking change
# event 2
par <- list(optional_integer = NULL, multiple_string = character(0))
# event 3
par <- list(optional_integer = 42L, multiple_string = c(NA_character_))
# event 0
par = {"optional_integer": 42, "multiple_string": ["a", None, "c"]}
# event 1
par = {"optional_integer" : 44, "multiple_string": None} # breaking change
# event 2
par = {"optional_integer": None, "multiple_string": []} # breaking change
# event 3
par = {"optional_integer": 42, "multiple_string": [None]}
// event 0
const par = {"optional_integer": 42, "multiple_string": ["a", undefined, "c"]}
// event 1
const par = {"optional_integer" : 44, "multiple_string": undefined} // breaking change
// event 2
const par = {"optional_integer": undefined, "multiple_string": []} // breaking change
// event 3
const par = {"optional_integer": 42, "multiple_string": [undefined]}
// event 0
var par = new { optional_integer = 42, multiple_string = new string[] { "a", null, "c" } };
// event 1
var par = new { optional_integer = 44, multiple_string = (string[])null };
// event 2
var par = new { optional_integer = (int?)null, multiple_string = new string[0] };
// event 3
var par = new { optional_integer = 42, multiple_string = new string[] { null } };
case class Par (
optional_integer: Some[Int],
multiple_string: Some[List[String]] // breaking change
)
// event 0
val par = Par(optional_integer = Some(42), multiple_string = Some(List("a", null, "c")))
// event 1
val par = Par(optional_integer = Some(44), multiple_string = None)
// event 2
val par = Par(optional_integer = None, multiple_string = Some(List()))
// event 3
val par = Par(optional_integer = Some(42), multiple_string = Some(List(null)))
# event0
par_optional_integer=42
par_multiple_string="a;UNDEFINED_ITEM;c"
# event1
par_optional_integer=44
# par_multiple_string is unset
# event2
# par_optional_integer is unset
par_multiple_string=""
# event3
par_optional_integer=42
par_multiple_string="UNDEFINED_ITEM"
The following set of changes are proposed.
UNDEFINED
and UNDEFINED_ITEM
are now reserved keywords for arguments when using a component from the CLI. Use quotes to pass the actual value to a component:
--optional_integer 'UNDEFINED'
→ None
--optional_string '"UNDEFINED"'
→ "UNDEFINED"
--optional_integer UNDEFINED_ITEM
→ None
--multiple_string 'UNDEFINED'
→ None
--multiple_string '"UNDEFINED"'
→ ["UNDEFINED"]
--multiple_string UNDEFINED_ITEM
→ [None]
When using a Python-based or JavaScript-based component from the CLI, passing an empty string to an argument with multiple: true
now turns into an empty list. Pass a quoted string to get an empty list:
--multiple_string ""
→ []
--multiple_string '""'
→ [""]
When using the CLI, any values starting and ending with a quote will be considered a single string and be stripped of their quotes:
--multiple_string 'foo;bar;baz'
→ ["foo", "bar", "baz"]
--multiple_string '"foo;bar;baz"'
→ ["foo;bar;baz"]
When an argument with multiple: true
is not passed a value, the resulting value is None
and not an empty list. With --multiple UNDEFINED
:
par <- list(multiple = character(0))
. Now: par <- list(multiple = NULL)
par = {"multiple": []}
. Now: par = {"multiple": None}
const par = {"multiple": []}
. Now: const par = {"multiple": undefined}
par_multiple=""
. Now: # par_multiple is not set
.val par = Par(multiple = Nil)
. Now: val par = Par(multiple = None)
.Allow distinguishing between an empty list ([]
) or a list of an empty string ([""]
). This can be achieved by quoting an empty string:
--my_multiple_arg ""
→ []
--my_multiple_arg '""'
→ [""]
Allow unsetting an argument (e.g. when it is optional but it already has a default) by introducing a reserved keyword (UNDEFINED
) for it. Use quotes to reproduce the original behaviour:
--my_multiple_arg UNDEFINED
→ None
--my_multiple_arg '"UNDEFINED"'
→ ["UNDEFINED"]
Allow passing UNDEFINED_ITEM values by introducing a reserved keyword (UNDEFINED_ITEM
) for it. Use quotes to reproduce the original behaviour:
--my_multiple_arg UNDEFINED_ITEM
→ [None]
--my_multiple_arg '"UNDEFINED_ITEM"'
→ ["UNDEFINED_ITEM"]
--my_multiple_arg 'a;UNDEFINED_ITEM;c'
→ ["a", None, "c"]
When using the CLI, allow passing values containing the separator:
--my_multiple_arg ';'
→ ["", ""]
--my_multiple_arg '";"'
→ [";"]
--my_multiple_arg '";"";'
→ ["", "", ""]
When using the CLI, allow passing strings containing quotes. You can escape quotes (\"
), separators (\;
), and the escape character (\\
).:
--my_multiple_arg '"quote"'
→ ["quote"]
--my_multiple_arg '\"noquote\"'
→ ['"quote"']
--my_multiple_arg 'escape\\slash'
→ ["escape\slash"]
--my_multiple_arg 'noneedto"quote'
→ ['noneedto"quote']
--my_multiple_arg 'escape\;sep'
→ ["escape;sep"]
When using a CSV as a param_list, allow escape the comma separator using quotes (","
) or backslashes (\,
). Example:
arg1,arg2
foo,bar
"fo,o",bar
fo\,o,bar
Feature summary
Not sure if this is a bug or a feature, since I don't know what your expected behaviour is for this :stuck_out_tongue:
Passing
PARAMETER=1:2:null:4
to an argument that accepts multiple integer values should not fail parameter validationFeature description
Passing a
null
value as part of an argument that accepts multiple integer values, in the format with the multiple separator (e.g1:2:null:4
), the validation for the parameters currently fails with following error message:Error in module '<MODULE>' id '<ID>': input argument '<ARGUMENT>' has the wrong type. Expected type: List[Integer]. Found type: List[java.lang.String]
This causes issues if you have 2 parameters that both accept multiple values that rely on eachother. I will try to explain myself better with an example.
Example
Imagine following parameters:
If you have 3 features called A,B and C and you want to provide cutoff values for all of them, but for some features you would like to skip a certain cutoff (can be done by passing
null
), it should be possible to pass the arguments as such:... --features=A:B:C --cutoff_low=4:null:3 --cutoff_high=100:400:null
The order and amount of values for each argument is important, since they are linked to each other. Skipping the
null
values is therefor not possible.My example shows this in a CLI context, but my usecase is more in a Nextflow workflow. The principles remain the same tho...
Why is this feature beneficial?
I don't know if this feature would be widely usable, or if it is just for my usecase...
However, I still think that this should be valid since an optional argument of type integer (or any type for that matter) does accept
null
. Making that argument accept multiple (integer) values, shouldn't change that behaviour IMO.Alternatives considered
I could try to change my script to accept the parameters in a different format, but currently don't have any ideas to work around it. Suggestions are welcome :smile:
Possible solution
The fix seems rather simple to me. If the value is
"null"
(null as string), convert to an actualnull
and run the normal parameter validation. I however am not sure if this could/would cause issues somewhere else...Confirmation