viash-io / viash

script + metadata = standalone component
https://viash.io
GNU General Public License v3.0
39 stars 2 forks source link

[FEATURE] allow `null` in integer format when `multiple: true` #705

Open kverstae opened 4 months ago

kverstae commented 4 months ago

Feature summary

Not sure if this is a bug or a feature, since I don't know what your expected behaviour is for this :stuck_out_tongue:

Passing PARAMETER=1:2:null:4 to an argument that accepts multiple integer values should not fail parameter validation

Feature description

Passing a null value as part of an argument that accepts multiple integer values, in the format with the multiple separator (e.g 1:2:null:4), the validation for the parameters currently fails with following error message:

Error in module '<MODULE>' id '<ID>': input argument '<ARGUMENT>' has the wrong type. Expected type: List[Integer]. Found type: List[java.lang.String]

This causes issues if you have 2 parameters that both accept multiple values that rely on eachother. I will try to explain myself better with an example.

Example

Imagine following parameters:

- name: "--features"
  type: string
  multiple: true
- name: "--cutoff_low"
  type: integer
  multiple: true
- name: "--cutoff_high"
   type: integer
   multiple: true

If you have 3 features called A,B and C and you want to provide cutoff values for all of them, but for some features you would like to skip a certain cutoff (can be done by passing null), it should be possible to pass the arguments as such:

... --features=A:B:C --cutoff_low=4:null:3 --cutoff_high=100:400:null

The order and amount of values for each argument is important, since they are linked to each other. Skipping the null values is therefor not possible.

My example shows this in a CLI context, but my usecase is more in a Nextflow workflow. The principles remain the same tho...

Why is this feature beneficial?

I don't know if this feature would be widely usable, or if it is just for my usecase...

However, I still think that this should be valid since an optional argument of type integer (or any type for that matter) does accept null. Making that argument accept multiple (integer) values, shouldn't change that behaviour IMO.

Alternatives considered

I could try to change my script to accept the parameters in a different format, but currently don't have any ideas to work around it. Suggestions are welcome :smile:

Possible solution

The fix seems rather simple to me. If the value is "null" (null as string), convert to an actual null and run the normal parameter validation. I however am not sure if this could/would cause issues somewhere else...

Confirmation

DriesSchaumont commented 4 months ago

Hi @kverstae, thanks for opening this issue! Just for the sake of convenience, I wanted to mention that issue #619 is closely related to this one.

rcannood commented 3 months ago

Hey @kverstae ! Thanks for creating this issue!

I've been thinking about this for a bit, specifically how to implement such functionality without rewriting everything from scratch and introducing very different behaviours and functionality.

Since values are passed in a delimiter-based way (e g. Delimiter is ;), it's important to be able to distinguish between empty lists and lists with an "empty" value. As such, I would propose the following:

We'd have to look into how each value gets translated into each of the different languages. In a follow-up post, we should outline a cross comparison of different examples and how it gets parsed into each of the different languages so we can build a unit test out of it.

Would that satisfy your needs @kverstae ?

kverstae commented 3 months ago

Thanks for taking a look at this @rcannood

Your proposal would indeed satisfy my needs.

I personally am not a big fan of the trailing delimiter. If I would just look at 1;;3;4;, I would interpret this as [1, NA, 3, 4, NA], since this is also how 'normal' delimited files (csv/tsv/...) work. This, however is just a small detail and I wouldn't mind if it was needed.

For strings, you could argue that when multiple is true and you get something like foo;;bar it would make sense to translate it into ["foo", NA, "bar"] to be in line with integers/booleans/doubles. This would indeed break backwards compatibility, but could maybe be 'fixed' by defining a new parameter in _viash.yml to specify how empty string values should be parsed: "" or NA/null/...

rcannood commented 3 months ago

Come to think of it, my proposal would cause an issue. Let me set up a 'test bench' to clarify.


Example

There are multiple way of specifying arguments -- either from the CLI, in Groovy, or as a param_list csv / yaml.

Arguments

Let us take the following list of arguments:

name: mycomp
arguments:
  - type: integer
    name: --optional_integer
    default: 42
    required: false
  - type: string
    name: --multiple_string
    required: false
    multiple: true
    allow_missing: true # new field added

(I could add all possible variants of required arguments, optional arguments, multiple arguments; for types string, integer, double, boolean, but it would be just more of the same).

Desired parameter sets

Let's say we want our channel to contain three events:

[
  [ "event0", [optional_integer: 42, multiple_string: ["a", "b", "c"] ],
  [ "event1", [optional_integer: 44] ],
  [ "event2", [multiple_string: [] ],
  [ "event3", [optional_integer: 42, multiple_string: ["b", null]] ]
]

Parameter ingestion variations

In Nextflow / Groovy

In Nextflow / Groovy, this is already possible. We can just create a channel with exactly these parameters and presto. However, to ensure that the default value of optional_integer is not used, we need to explicitly pass null to this argument.

Channel.fromList([
  // optional_integer is omitted, since it's the default
  [ "event0", [multiple_string: ["a", "b", "c"] ],
  // multiple_string is omitted, since the default is already what we want it to be
  [ "event1", [optional_integer: 44] ],
  // explicitly unset `optional_integer`, and explicitly set `multiple_string` to an empty list
  [ "event2", [optional_integer: null, multiple_string: [] ],
  // straight forward
  [ "event3", [multiple_string: ["b", null]] ]
])
  | mycomp.run(
    fromState: ["optional_integer", "multiple_string"],
    toState: [...]
  )

As a yaml param list

All Viash workflows can also ingest a parameter list (csv, json, or yaml) using the param_list argument. For instance:

./params.yaml:

param_list:
  - id: event0
    multiple_string: ["a", "b", "c"]
  - id: event1
    optional_integer: 44
  - id: event2
    optional_integer: null
    multiple_string: []
  - id: event3
    multiple_string: ["b", null]

and then running the component with nextflow run . -main-script target/nextflow/mycomp/main.nf -params-file ./params.yaml should yield exactly the same result.

As a csv param list

However, using a csv based parameter list causes issues here, because we can define the following csv:

id,optional_integer,multiple_string
event0,42,a;b;c
event1,44,NULL
event2,NULL,
event3,42,b;NULL

However, there are three issues with this:

  1. There is currently no way to explicitly unset an argument (Related to #619). Let's add a reserved keyword like NULL, MISSING, NO_VALUE, UNSET.

  2. As-is, it's not possible to distinguish between whether the value for multiple_string in event2 represents [] or [""]. To this extend, I would propose the separator notation discussed earlier:

    • ` →[]`
    • ;[""]
    • foo;["foo"]: separator just denotes the end of an element
    • foo;bar["foo", "bar"]: separator at the end is implicit
    • foo;;["foo", ""]: extra separator denotes an empty string in the second value

When it comes to unsetting values:

As a CLI

When calling an executable from the CLI, we can apply the same parsing as with the CSV, except the user can also call the same argument multiple times. Specifically, the following should be valid:

param_list:
  - id: event0
    multiple_string: ["a", "b", "c"]
  - id: event1
    optional_integer: 44
  - id: event2
    optional_integer: null
    multiple_string: []
  - id: event3
    multiple_string: ["b", null]
# event 0
target/executable/mycomp/mycomp \
  --multiple_string a --multiple_string b --multiple_string c

# event 1
target/executable/mycomp/mycomp \
  --optional_integer 44

# event 2
target/executable/mycomp/mycomp \
  --optional_integer "NULL" \
  --multiple_string ";"

# event 3
target/executable/mycomp/mycomp \
  --multiple_string "b" --multiple_string "NULL"

However, the following should also be valid

# event 0
target/executable/mycomp/mycomp \
  --multiple_string "a;b;c"

# event 1
target/executable/mycomp/mycomp \
  --optional_integer 44

# event 2
target/executable/mycomp/mycomp \
  --optional_integer "NULL" \
  --multiple_string ";"

# event 3
target/executable/mycomp/mycomp \
  --multiple_string "b;NULL"

And the latter should also work using the Nextflow CLI:

# event 1
nextflow run . -main-script target/nextflow/mycomp.main.nf \
  --multiple_string "a;b;c"

# etc...

How it's interpreted in the scripts

R

# event0
par <- list(optional_integer = 42L, multiple_string = c("a", "b", "c"))

# event 1
par <- list(optional_integer = 44L, multiple_string = NULL) # breaking change

# event 2
par <- list(optional_integer = NULL, multiple_string = character(0))

# event 3
par <- list(optional_integer = 42L, multiple_string = c("b", NA_character_))

Python

# event0
par = {"optional_integer": 42, multiple_string: ["a", "b", "c"]}

# event 1
par = {"optional_integer" : 44, multiple_string: None} # breaking change

# event 2
par = {"optional_integer": None, multiple_string: []}

# event 3
par = {"optional_integer": 42, multiple_string: ["b", None]}

JavaScript

// todo

C

// todo

Scala

// todo

Bash

# event0
par_optional_integer=42
par_multiple_string="a;b;c;"  # breaking change?

# event1
par_optional_integer=44
# par_multiple_string is unset

# event2
# par_optional_string is unset
par_multiple_string=""

# event3
par_optional_string=42
par_multiple_string="b;NULL;"

Though I would strongly recommend doing advanced processing with nullable multiple arguments in Bash :sweat_smile:


Conclusion

In conclusion, I propose the following two extensions:

New functionality

Breaking changes

@DriesSchaumont Do you have any comments regarding the usability of this? And the proposed breaking changes? @Grifs Are there any technical issues I'm overlooking?

DriesSchaumont commented 3 months ago

Just two questions to clarify:

  1. If I understood correctly, in your last proposal there will be no more differences in functionality between the different argument type with respect to the behavior of the 'null' value, correct (as opposed to the first proposal in https://github.com/viash-io/viash/issues/705#issuecomment-2165716362)?
  2. Could you clarify what the case --multiple_string '' would do in your proposal? Does this equal `` → []?

A general remark regarding making breaking changes: I think its OK on the condition that we introduce a viash config value to override the 'not set' value (as suggested in https://github.com/viash-io/viash/issues/619#issuecomment-2124576166) because this would allow setting this config value to avoid making changes to the scripts. In 0.9.0 we will introduce breaking changes to the config already, but if we can make sure that the breaking changes can be resolved by adjusting configs only I think thats a better situation than needing to adjust both scripts and configs.

rcannood commented 3 months ago
  1. If I understood correctly, in your last proposal there will be no more differences in functionality between the different argument type with respect to the behavior of the 'null' value, correct (as opposed to the first proposal in [FEATURE] allow null in integer format when multiple: true #705

Indeed!


  1. Could you clarify what the case --multiple_string '' would do in your proposal? Does this equal `` → []?

With the latest proposal:


Come to think of it, being able to distinguish between [] and [""] could also be solved by introducing yet another keyword. For instance, NULL, NONE, or UNDEFINED could map to None, and NIL, or EMPTY_LIST could map to []. In that case, "" would still map to [""] and would thus result in fewer breaking change.

Maybe I prefer UNDEFINED and EMPTY_LIST better as it is more explicit.

DriesSchaumont commented 3 months ago
  1. If I understood correctly, in your last proposal there will be no more differences in functionality between the different argument type with respect to the behavior of the 'null' value, correct (as opposed to the first proposal in [FEATURE] allow null in integer format when multiple: true [FEATURE] allow null in integer format when multiple: true #705

Indeed!

  1. Could you clarify what the case --multiple_string '' would do in your proposal? Does this equal `` → []?

With the latest proposal:

* `--multiple_string NULL` → `None`

* `--multiple_string ""` →  `[]`

* `--multiple_string ";"` → `[""]`

* `--multiple_string "foo;"` → `["foo"]`: separator just denotes the end of an element

* `--multiple_string "foo;bar"` → `["foo", "bar"]`: separator at the end is implicit

* `--multiple_string "foo;;"` → `["foo", ""]`: extra separator denotes an empty string in the second value

I concur with the remark from @kverstae that the use of the separator is counterintuitive here. Am I overlooking something when I say that this could also work?

Come to think of it, being able to distinguish between [] and [""] could also be solved by introducing yet another keyword. For instance, NULL, NONE, or UNDEFINED could map to None, and NIL, or EMPTY_LIST could map to []. In that case, "" would still map to [""] and would thus result in fewer breaking change.

Maybe I prefer UNDEFINED and EMPTY_LIST better as it is more explicit.

I'm not in favour of adding another keyword. I think we will be introducing breaking changes anyway, and introducing fewer breaking changes does not warrant the complexity overhead of adding another keyword.

rcannood commented 3 months ago

I'm not in favour of adding another keyword.

Yes, but by adding ""[] and '""'[""], you're effectively adding a keyword, only that the naming isn't very explicit :stuck_out_tongue:

All kidding aside, we could take your quote idea and extend it

This allows users to pass the 'UNDEFINED' string by escaping it:

Furthermore, we should then also allow escaping separator, quotes and slashes:

However, what happens to characters which need to be escaped, when the string is not surrounded by quotes?

rcannood commented 3 months ago

OHNOES! I came across two more edge cases

Edge case: List of missing value

We need to be able to distinguish between None and [None]. with the proposal above, we can turn UNDEFINED into None and foo;UNDEFINED;bar into ["foo", None, "bar"], but it's not possible to create a list only containing a missing value -- i.e. [None].

I think the only thing we can do is define another keyword MISSING that specifically denotes a missing value in a list. Since UNDEFINED and MISSING sound a bit too similar, I propose renaming UNDEFINED to NOT_SET.

Edge case: are optional arguments (with multiple: false) also allowed to process MISSING values?

In R, it might make sense interpret CLI values as follows:

However, in Python both cases would translate into {"optional_arg": None}. The same is true for JavaScript.

One solution might be to allow MISSING values only when multiple: true, and NOT_SET values only when required: false (I already assumed the latter was already the case anyways).

rcannood commented 3 months ago

I'm going to update my previous post with the latest updates.


Problem statement

There are multiple way of specifying arguments -- either from the CLI, in Groovy, or as a param_list csv / yaml.

Right now, passing arguments directly in Groovy or in a YAML param_list offers the greatest flexibility in distinguishing between things:

This is due to the command-line interface having to pass values as strings (e.g. --my_multiple_arg "") and the CLI not being able to distinguish between the various edge cases. The following proposal aims to change this by introducing a few keywords, quoting, and the ability to escape certain characters.


Example

Let us take the following list of arguments:

name: mycomp
arguments:
  - type: integer
    name: --optional_integer
    default: 42
    required: false
  - type: string
    name: --multiple_string
    required: false
    multiple: true
    allow_missing: true # new field added

(I could add all possible variants of required arguments, optional arguments, multiple arguments; for types string, integer, double, boolean, but it would be just more of the same).

Desired parameter sets

Let's say we want our channel to contain three events:

[
  [ "event0", [optional_integer: 42, multiple_string: ["a", null, "c"] ],
  [ "event1", [optional_integer: 44, multiple_string: null] ],
  [ "event2", [optional_integer: null, multiple_string: [] ],
  [ "event3", [optional_integer: 42, multiple_string: [null]] ]
]

Parameter ingestion variations

In Nextflow / Groovy

In Nextflow / Groovy, this is already possible. We can just create a channel with exactly these parameters and presto. However, to ensure that the default value of optional_integer is not used, we need to explicitly pass null to this argument.

Channel.fromList([
  // optional_integer can be omitted, since it's the default
  [ "event0", [multiple_string: ["a", null, "c"] ],
  // multiple_string can be omitted, since the default is already what we want it to be
  [ "event1", [optional_integer: 44] ],
  // explicitly unset `optional_integer`, and explicitly set `multiple_string` to an empty list
  [ "event2", [optional_integer: null, multiple_string: [] ],
  // optional_integer can be omitted
  [ "event3", [multiple_string: [null]] ]
])
  | mycomp.run(
    fromState: ["optional_integer", "multiple_string"],
    toState: [...]
  )

As a yaml param list

All Viash workflows can also ingest a parameter list (csv, json, or yaml) using the param_list argument. For instance:

./params.yaml:

param_list:
  - id: event0
    multiple_string: ["a", null, "c"]
  - id: event1
    optional_integer: 44
  - id: event2
    optional_integer: null
    multiple_string: []
  - id: event3
    multiple_string: [null]

and then running the component with nextflow run . -main-script target/nextflow/mycomp/main.nf -params-file ./params.yaml should yield exactly the same result.

As a csv param list

However, using a csv based parameter list causes issues here, because we can define the following csv:

id,optional_integer,multiple_string
event0,42,a;MISSING;c
event1,44,NOT_SET
event2,NOT_SET,
event3,42,MISSING

This introduces different notations.

  1. We should be able to distinguish between an empty list ([]) or a list of an empty string ([""]). We can do so by allowing the user to quote the empty string:

    • ` →[]`
    • ""[""]
  2. We should be able to unset an argument (e.g. when it is optional but it already has a default). We can add a reserved keyword (NOT_SET) for it, but the user can also quote it to pass the original string.

    • NOT_SETNone
    • "NOT_SET"["NOT_SET"]
  3. We should be able to pass values containing the separator. Proposal:

    • ;["", ""]
    • ";"[";"]
    • ';"";'["", "", ""]
  4. We should also be able to pass strings containing quotes. Therefore, we should be able to escape quotes (\"), and therefore also need to be able to escape the escape character (\\). While we're at it, we should then also be able to escape separators (\;). For CSV files, we should also be able to escape the csv separator (\,)

    • "quote"["quote"]
    • \"noquote\"['"quote"']
    • escape\\slash["escape\slash"]
    • noneedto"quote['noneedto"quote']
    • escape\;sep["escape;sep"]
    • escape\,sep["escape,sep"]

Note that other combinations of escaped characters (e.g. \n) should remain unchanged.

From the CLI

When calling an executable from the CLI, we can apply the same parsing as with the CSV, except the user can also call the same argument multiple times. Specifically, the following should be valid:

# event 0
target/executable/mycomp/mycomp \
  --multiple_string a --multiple_string MISSING --multiple_string c

# event 1
target/executable/mycomp/mycomp \
  --optional_integer 44

# event 2
target/executable/mycomp/mycomp \
  --optional_integer NOT_SET \
  --multiple_string ""

# event 3
target/executable/mycomp/mycomp \
  --multiple_string MISSING

However, the following should also be valid

# event 0 alternative
target/executable/mycomp/mycomp \
  --multiple_string "a;MISSING;c"

# event 0 yet another alternative
target/executable/mycomp/mycomp \
  --multiple_string "a" --multiple_string "MISSING;c"

And the latter should also work using the Nextflow CLI:

# event 0
nextflow run . -main-script target/nextflow/mycomp/main.nf \
  --multiple_string "a;MISSING;c"

How it's interpreted in the scripts

R

# event 0
par <- list(optional_integer = 42L, multiple_string = c("a", NA_character_, "c"))

# event 1
par <- list(optional_integer = 44L, multiple_string = NULL) # breaking change

# event 2
par <- list(optional_integer = NULL, multiple_string = character(0))

# event 3
par <- list(optional_integer = 42L, multiple_string = c(NA_character_))

Python

# event 0
par = {"optional_integer": 42, "multiple_string": ["a", None, "c"]}

# event 1
par = {"optional_integer" : 44, "multiple_string": None} # breaking change

# event 2
par = {"optional_integer": None, "multiple_string": []} # breaking change

# event 3
par = {"optional_integer": 42, "multiple_string": [None]}

JavaScript

// event 0
const par = {"optional_integer": 42, "multiple_string": ["a", undefined, "c"]}

// event 1
const par = {"optional_integer" : 44, "multiple_string": undefined} // breaking change

// event 2
const par = {"optional_integer": undefined, "multiple_string": []} // breaking change

// event 3
const par = {"optional_integer": 42, "multiple_string": [undefined]}

C

// event 0
var par = new { optional_integer = 42, multiple_string = new string[] { "a", null, "c" } };

// event 1
var par = new { optional_integer = 44, multiple_string = (string[])null };

// event 2
var par = new { optional_integer = (int?)null, multiple_string = new string[0] };

// event 3
var par = new { optional_integer = 42, multiple_string = new string[] { null } };

Scala

case class Par (
  optional_integer: Some[Int],
  multiple_string: Some[List[String]] // breaking change
)

// event 0
val par = Par(optional_integer = Some(42), multiple_string = Some(List("a", null, "c")))

// event 1
val par = Par(optional_integer = Some(44), multiple_string = None)

// event 2
val par = Par(optional_integer = None, multiple_string = Some(List()))

// event 3
val par = Par(optional_integer = Some(42), multiple_string = Some(List(null)))

Bash

# event0
par_optional_integer=42
par_multiple_string="a;MISSING;c"

# event1
par_optional_integer=44
# par_multiple_string is unset

# event2
# par_optional_integer is unset
par_multiple_string=""

# event3
par_optional_integer=42
par_multiple_string="MISSING"

Conclusion

The following set of changes are proposed.

Breaking changes

New functionality

tverbeiren commented 3 months ago

I would personally prefer --multiple_string ""[""] rather than [] because intuitively you only provide 1 (empty) argument in the list IMO. I understand though that this causes inconsistencies elsewhere.

A question: when you pass --single_opt 'foo' as a parameter on the shell, does the shell not automatically strip the '? Or in other words, are we able to distinguish between, e.g. 'MISSING', "MISSING" and MISSING?

Why would we use NOT_SET as a keyword and not NA which to me is more intuitive?

Am I right in thinking that MISSING for an entry in a multiple element is the same as NOT_SET for the element itselft? Both translate to None or am I wrong? Could this be simplified? I guess not because in the examples above both point to different values. I'm afraid this will be hard to remember. But then again, we we are considering edge cases I guess? One suggestion I could make is to use something like NOT_SET for the main argument and NOT_SET_ITEM for the a list item [^1].

[^1]: But then use NA instead of NOT_SET 🤷

DriesSchaumont commented 3 months ago

I would personally prefer --multiple_string ""[""] rather than [] because intuitively you only provide 1 (empty) argument in the list IMO. I understand though that this causes inconsistencies elsewhere.

From the perspective of the shell, I find you example counterintuitive because:

$ echo '""'
""
$ echo ''

$ echo ""

Also, a second issue arises from your poposed syntax: how to specify an empty list?

A question: when you pass --single_opt 'foo' as a parameter on the shell, does the shell not automatically strip the '? Or in other words, are we able to distinguish between, e.g. 'MISSING', "MISSING" and MISSING?

It does (see examples above). As an extra note, you can pass whatever you would like given the proper escaping

$ echo "\"MISSING"\"
"MISSING

Why would we use NOT_SET as a keyword and not NA which to me is more intuitive? Am I right in thinking that MISSING for an entry in a multiple element is the same as NOT_SET for the element itselft? Both translate to None or am I wrong? Could this be simplified? I guess not because in the examples above both point to different values. I'm afraid this will be hard to remember. But then again, we we are considering edge cases I guess? One suggestion I could make is to use something like NOT_SET for the main argument and NOT_SET_ITEM for the a list item 1.

Footnotes

1. But then use `NA` instead of `NOT_SET` 🤷 [↩](#user-content-fnref-1-fe46bbd51c562bc33419e0e2e876499a)

The reason why we need two values is because of an edge case Robrecht described above: we need to be able to distinguish between None and [None].

Given only one value

There is no real way to define [None].

tverbeiren commented 3 months ago

Given only one value

  • --multiple_string SOME_SPECIAL_VALUENone
  • --multiple_string ""[]
  • --multiple_string '""'[""]
  • --multiple_string "SOME_SPECIAL_VALUE"["SOME_SPECIAL_VALUE"]

There is no real way to define [None].

Hence my suggestion:

For me, the _ITEM addition makes clear an item in the list is missing rather than the list itself. I find that having that connotation makes it clearer than MISSING.

rcannood commented 3 months ago

I was at first strongly opposed to using the same name (NOT_SET and NOT_SET_ITEM) for, what felt like, two very distinct things. However, the more that I thought about how to formulate how I feel that the two concepts are different, the more I realised that this is probably just in my head.

I don't really like NOT_SET_ITEM because it's quite a mouthful.

I propose we borrowing from terminology used in other programming languages. For example:

Taking into account the popularity of each of these languages, I guess NONE and NULL are the best options. Whether we make ITEM (or ELEMENT?) a prefix or a suffix is another question.

Happy to pick this up during a frodo to discuss!

rcannood commented 3 months ago

UNDEFINED en UNDEFINED_ITEM!

rcannood commented 3 months ago

Final proposal:

I'm going to update my previous post with the latest updates.


Problem statement

There are multiple way of specifying arguments -- either from the CLI, in Groovy, or as a param_list csv / yaml.

Right now, passing arguments directly in Groovy or in a YAML param_list offers the greatest flexibility in distinguishing between things:

This is due to the command-line interface having to pass values as strings (e.g. --my_multiple_arg "") and the CLI not being able to distinguish between the various edge cases. The following proposal aims to change this by introducing a few keywords, quoting, and the ability to escape certain characters.


Example

Let us take the following list of arguments:

name: mycomp
arguments:
  - type: integer
    name: --optional_integer
    default: 42
    required: false
  - type: string
    name: --multiple_string
    required: false
    multiple: true
    allow_UNDEFINED_ITEM: true # new field added

(I could add all possible variants of required arguments, optional arguments, multiple arguments; for types string, integer, double, boolean, but it would be just more of the same).

Desired parameter sets

Let's say we want our channel to contain three events:

[
  [ "event0", [optional_integer: 42, multiple_string: ["a", null, "c"] ],
  [ "event1", [optional_integer: 44, multiple_string: null] ],
  [ "event2", [optional_integer: null, multiple_string: [] ],
  [ "event3", [optional_integer: 42, multiple_string: [null]] ]
]

Parameter ingestion variations

In Nextflow / Groovy

In Nextflow / Groovy, this is already possible. We can just create a channel with exactly these parameters and presto. However, to ensure that the default value of optional_integer is not used, we need to explicitly pass null to this argument.

Channel.fromList([
  // optional_integer can be omitted, since it's the default
  [ "event0", [multiple_string: ["a", null, "c"] ],
  // multiple_string can be omitted, since the default is already what we want it to be
  [ "event1", [optional_integer: 44] ],
  // explicitly unset `optional_integer`, and explicitly set `multiple_string` to an empty list
  [ "event2", [optional_integer: null, multiple_string: [] ],
  // optional_integer can be omitted
  [ "event3", [multiple_string: [null]] ]
])
  | mycomp.run(
    fromState: ["optional_integer", "multiple_string"],
    toState: [...]
  )

As a yaml param list

All Viash workflows can also ingest a parameter list (csv, json, or yaml) using the param_list argument. For instance:

./params.yaml:

param_list:
  - id: event0
    multiple_string: ["a", null, "c"]
  - id: event1
    optional_integer: 44
  - id: event2
    optional_integer: null
    multiple_string: []
  - id: event3
    multiple_string: [null]

and then running the component with nextflow run . -main-script target/nextflow/mycomp/main.nf -params-file ./params.yaml should yield exactly the same result.

As a csv param list

However, using a csv based parameter list causes issues here, because we can define the following csv:

id,optional_integer,multiple_string
event0,42,a;UNDEFINED_ITEM;c
event1,44,UNDEFINED
event2,UNDEFINED,
event3,42,UNDEFINED_ITEM

This introduces different notations.

  1. We should be able to distinguish between an empty list ([]) or a list of an empty string ([""]). We can do so by allowing the user to quote the empty string:

    • ` →[]`
    • ""[""]
  2. We should be able to unset an argument (e.g. when it is optional but it already has a default). We can add a reserved keyword (UNDEFINED) for it, but the user can also quote it to pass the original string.

    • UNDEFINEDNone
    • "UNDEFINED"["UNDEFINED"]
  3. We should be able to pass values containing the separator. Proposal:

    • ;["", ""]
    • ";"[";"]
    • ';"";'["", "", ""]
  4. We should also be able to pass strings containing quotes. Therefore, we should be able to escape quotes (\"), and therefore also need to be able to escape the escape character (\\). While we're at it, we should then also be able to escape separators (\;). For CSV files, we should also be able to escape the csv separator (\,)

    • "quote"["quote"]
    • \"noquote\"['"quote"']
    • escape\\slash["escape\slash"]
    • noneedto"quote['noneedto"quote']
    • escape\;sep["escape;sep"]
    • escape\,sep["escape,sep"]

Note that other combinations of escaped characters (e.g. \n) should remain unchanged.

From the CLI

When calling an executable from the CLI, we can apply the same parsing as with the CSV, except the user can also call the same argument multiple times.

Note that the examples below only show target/executable/mycomp/mycomp, but the same also applies when running nextflow components via the nextflow CLI.

Specifically, the following should be valid:

# event 0
target/executable/mycomp/mycomp \
  --multiple_string a --multiple_string UNDEFINED_ITEM --multiple_string c

# event 1
target/executable/mycomp/mycomp \
  --optional_integer 44

# event 2
target/executable/mycomp/mycomp \
  --optional_integer UNDEFINED \
  --multiple_string ""

# event 3
target/executable/mycomp/mycomp \
  --multiple_string UNDEFINED_ITEM

However, the following should also be valid

# event 0 alternative
target/executable/mycomp/mycomp \
  --multiple_string "a;UNDEFINED_ITEM;c"

# event 0 yet another alternative
target/executable/mycomp/mycomp \
  --multiple_string "a" --multiple_string "UNDEFINED_ITEM;c"

And the latter should also work using the Nextflow CLI:

# event 0
nextflow run . -main-script target/nextflow/mycomp/main.nf \
  --multiple_string "a;UNDEFINED_ITEM;c"

How it's interpreted in the scripts

R

# event 0
par <- list(optional_integer = 42L, multiple_string = c("a", NA_character_, "c"))

# event 1
par <- list(optional_integer = 44L, multiple_string = NULL) # breaking change

# event 2
par <- list(optional_integer = NULL, multiple_string = character(0))

# event 3
par <- list(optional_integer = 42L, multiple_string = c(NA_character_))

Python

# event 0
par = {"optional_integer": 42, "multiple_string": ["a", None, "c"]}

# event 1
par = {"optional_integer" : 44, "multiple_string": None} # breaking change

# event 2
par = {"optional_integer": None, "multiple_string": []} # breaking change

# event 3
par = {"optional_integer": 42, "multiple_string": [None]}

JavaScript

// event 0
const par = {"optional_integer": 42, "multiple_string": ["a", undefined, "c"]}

// event 1
const par = {"optional_integer" : 44, "multiple_string": undefined} // breaking change

// event 2
const par = {"optional_integer": undefined, "multiple_string": []} // breaking change

// event 3
const par = {"optional_integer": 42, "multiple_string": [undefined]}

C

// event 0
var par = new { optional_integer = 42, multiple_string = new string[] { "a", null, "c" } };

// event 1
var par = new { optional_integer = 44, multiple_string = (string[])null };

// event 2
var par = new { optional_integer = (int?)null, multiple_string = new string[0] };

// event 3
var par = new { optional_integer = 42, multiple_string = new string[] { null } };

Scala

case class Par (
  optional_integer: Some[Int],
  multiple_string: Some[List[String]] // breaking change
)

// event 0
val par = Par(optional_integer = Some(42), multiple_string = Some(List("a", null, "c")))

// event 1
val par = Par(optional_integer = Some(44), multiple_string = None)

// event 2
val par = Par(optional_integer = None, multiple_string = Some(List()))

// event 3
val par = Par(optional_integer = Some(42), multiple_string = Some(List(null)))

Bash

# event0
par_optional_integer=42
par_multiple_string="a;UNDEFINED_ITEM;c"

# event1
par_optional_integer=44
# par_multiple_string is unset

# event2
# par_optional_integer is unset
par_multiple_string=""

# event3
par_optional_integer=42
par_multiple_string="UNDEFINED_ITEM"

Conclusion

The following set of changes are proposed.

Breaking changes

New functionality