Open stxue1 opened 4 months ago
To address your first question: paths are relative to the execution environment. So if a container is used, then that means relative to the working directory inside the container. But that means it's generally a bad idea to use relative paths as default values for inputs because the execution engine is free to make the working directory be whatever it wants. For example, it could mount a volume on the host machine to /foo
in the container and then make the working directory in the container be /foo/bar
. Obviously "test.txt" won't exist in this directory. The spec should probably be updated to make this caveat clear and caution against using relative paths.
You are correct that missing optional workflow outputs should also be None
.
For the last part - I'm not sure I fully understand. Could you create a separate issue with a minimal reproducible test case (including any necessary input files)? Thanks
For the last part - I'm not sure I fully understand. Could you create a separate issue with a minimal reproducible test case (including any necessary input files)? Thanks
Sure, I think my main concern in the original included WDL workflow is how the len
outputs (len_in_body_out
, len_in_input_out
, and len_in_output
) are all different, with 2, 2, and 0 respectively, even though the declarations they compute should be functionally equivalent.
For example (Much of the WDL evaluation is in tasks instead of the workflow due to a limitation of miniwdl):
version 1.1
workflow testWorkflow_output {
input {
}
call testTask
output {
Array[File?] array_of_files = testTask.array_of_files
Int len = testTask.len
}
}
task testTask {
input {
}
command <<<>>>
output {
Array[File?] array_of_files = ["example1.txt", "example2.txt"]
Int len = length(select_all(array_of_files))
}
}
In the task, the files example1.txt
and example2.txt
do not exist, so I'd expect that array_of_files
is equal to [null, null]
. Because they do not exist, I'd expect select_all(array_of_files)
to equal []
, as all elements are null. Thus, len
should be 0.
For this workflow, Miniwdl agrees:
{
"dir": "/home/heaucques/Documents/wdl-conformance-tests/20240724_125731_testWorkflow_output",
"outputs": {
"testWorkflow_output.array_of_files": [
null,
null
],
"testWorkflow_output.len": 0
}
}
However, given a workflow like this:
version 1.1
workflow testWorkflow_body {
input {
}
call testTask
output {
Array[File?] array_of_files = testTask.array_of_files
Int len = testTask.len
}
}
task testTask {
input {
}
command <<<>>>
Array[File?] array_of_files_0 = ["example1.txt", "example2.txt"]
Int len_0 = length(select_all(array_in_body))
output {
Array[File?] array_of_files = array_of_files_0
Int len = len_0
}
}
I've effectively moved these lines from the output of the task to the body:
Array[File?] array_of_files = ["example1.txt", "example2.txt"]
Int len = length(select_all(array_of_files))
Thus, the second workflow will compute array_of_files
in the body section (computed in array_of_files_0
and copied into array_of_files
), while the first workflow computes array_of_files
in the output section.
Assuming the files do not exist on my machine, my "intuition" would think that the output of both workflows are the same:
For the first workflow testWorkflow_output
: the files don't exist in the container, so the array array_of_files
should be empty and the length of the array after a select_all
function should be 0
.
For the second workflow testWorkflow_body
: the files don't exist on my machine, so the array array_of_files_0
(and in extension array_of_files
) should be empty as well, and the length should also be 0
.
However, the spec specifically says that the coercion of an optional file will become null only in task outputs: https://github.com/openwdl/wdl/blob/caff59db192636d9f93f3f5659eb5939f51ff877/SPEC.md?plain=1#L3880
All file outputs are required to exist, otherwise the task will fail... the value will be undefined if the file does not exist
As a result, for the second workflow testWorkflow_body
, the nonexistent files in array_of_files_0
are not coerced, so select_all(array_of_files_0)
will see the array ["example1.txt", "example2.txt"]
instead of [null, null]
. Only when the output section is reached, or where array_of_files
is declared, will the file elements be coerced to null
.
Miniwdl seems to follow this concept, as the output for testWorkflow_body
is:
{
"dir": "/home/heaucques/Documents/wdl-conformance-tests/20240724_131039_testWorkflow_body",
"outputs": {
"testWorkflow_body.array_of_files": [
null,
null
],
"testWorkflow_body.len": 2
}
}
Even though array_of_files
is [null, null]
, the length (or the number of nonnull elements in the array) is technically reported as 2.
Rather than applying the coercion of the nonexistent files ["example1.txt", "example2.txt"
to become [null, null]
when array_of_files
is declared in the output section (with the line array_of_files = array_of_files_0
), should the coercion also happen in the body? Or since this is defined in the spec, is this behavior expected?
I originally encountered these issues at https://github.com/chanzuckerberg/miniwdl/issues/696.
One thing the WDL spec is vague about is how a task should coerce string to file. The spec says that all non-output declarations must run prior to the command section. My implicit understanding is that the output declaration will be running under a different directory than the rest of the task. It sounds like the output declarations are running in the current directory under the host machine, while the output section is running in the current directory inside the container.
For example:
Assuming all files exist, it's implicitly assumed that
f_input
andf_body
will point to some file on the host machine, butf_output
will point to the file inside the container. Maybe this should be clarified in the SPEC, as it is not immediately obvious.Another issue that arose when testing around with miniwdl is that there can be inconsistent behavior with coerced optional files.
Given the WDL workflow:
The spec says that optional file types at task outputs will be coerced to null.
For one, is there a reason why this scope is limited to just task outputs and not workflow outputs?
Additionally, because the spec says this null coercion is applied at the output step, given that the files
example1.txt
andexample2.txt
don't exist, the assumed correct output for the WDL workflow above is:Because the null coercion happens at the task output, the
select_all
function calls all will return different values depending on what part of the section it is called in; the body will return["example1.txt", "example2.txt"]
, giving a length of 2. However, for the task output declaration, the functionselect_all
will return[null, null]
, giving a length of 0. Since this can be counterintuitive as one may expect that a nonexistent file will always not be counted in aselect_all
call, is this the expected behavior, or what should the expected behavior be?