Closed caroott closed 3 months ago
Reading again through this: This is a question specific to when a run executes a workflow, right? When the run is self-contained, the run.cwl
is too?
That depends. I interpreted the ARC specification so, that every computational step should described either as a tool or workflow description and saved in the workflows folder. That wouldn't allow for self-contained runs, unless the run requires no computational steps.
This was simply due to a mistake: run.cwl is meant to be run.yml. The idea is that under workflow you find the more re-usable part and run is facilitated by the specific run parameter: especially the concrete input/output!!!
To add to this issue, after a discussion we had: We have no way of telling how a run is intended to be executed, unless it is executed and a run report is generated in any way. So we need a way to declare the intention, which combination of cwl and yml file should be executed for the specific run.
Originally, there was the arc.cwl in the root, which should execute the whole ARC upon running. This was dropped for ease of use and to not overcomplicate things as I understood it. This would be one possibility to get the connection of workflow/tool file and jobfile for a run. The other possibility would be the example I posted above:
cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cwltool, ../../workflows/MyWorkflow/workflow.cwl, run.yml]
outputs:
myOutput:
type: Directory
outputBinding:
# this returns the whole output directory
glob: $(runtime.outdir)/myDir
One of those two possibilities, or a third one that handles it, should be implemented to get that connection info. It would also be useful to get input from other people working with ARCs, what they prefer for ease of use. What do you think about this issue @Brilator and @floWetzels ?
Do I understand the question correctly: how do we document what "run.yml" + "workflow.cwl" combination yield what output?
The way I currently do is similar to above, heaving a readme in the respective runs folder with something like
cwltool ../../workflows/MyWorkflow/workflow.cwl run.yml
.
Plus I was planning to collect the overall ARC analysis / workflows with one arc.cwl in the root (currently more for visualization of the in-and-outs).
Yes, thats the question here. The arc.cwl in the root you mention would be the first case with the arc.cwl that executes the whole run. A readme in the runs folder also solves the question, at least for the user reading the ARC. The problem there would be how we ensure, that it follows a specific format and is also machine readable, so we can include it in the ARC datamodel.
Yes, I meant to confirm, that my non-machine-readable solution was aiming in the same direction.
Not sure about your outputs
bound to directory. Or is this just one example and one would have to adapt for other workflows?
This output would vary between runs. Each run.cwl would have the directory where the run is stored written there
I would for now add Version 2 to the ARC specification. This way we have a way to accurately identify the intention of run execution and the run execution itself. If in the future a better solution comes up, this could be subject to change again.
The ARC specification states under Workflow description, that that tools and workflows, that are used during computational analysis must be described in the workflows folder as
.cwl
files. In the run description it is stated, that each run needs a correspondingrun.cwl
, that describes how that exact run result is composed.Due to the nature of CWL, this
run.cwl
may be unnecessary overhead or could be simplified. All necessary information about therun
execution can be derived from the combination of the executed.cwl
file and therun.yml
. Therun.yml
is already located in the correspondingruns
folder. So only the information of the CWL file that was executed remains. If one were to create therun.cwl
as stated in the specification, I have two possibilities in mind:1. Wrap the executed tool or workflow in another workflow:
This comes with the disadvantage, that it is quite a large overhead. All inputs required must be specified in the workflow again, and mapped to the inputs required by the tool/workflow. The outputs then must be collected as usual in a workflow.
In the worst case, the
run.cwl
file is almost like a copy of the referencedworkflow.cwl
.2. Create a tool CWL, that executes the cwl runner with the given cwl and yml files:
Example:
This way, it's just the executing command wrapped in a command line tool CWL. It returns the entire output directory, so as long as the executed workflow is well described, it should return everything as intended. This could only be difficult, if expression tools are used at the end of a workflow to sort files. This is only a small overhead and contains all required information.
Since the information we require is only what workflow/tool is executed, can we maybe find a better way to represent that information? Or do we want to stick with the
run.cwl
and recommend the example i posted? Or do we want to recommend wrapping everything in one workflow again?Edit: links, format, small adjustments