lanl / BEE

Other
14 stars 3 forks source link

Investigate alternative CWL parsing solutions #145

Closed mcpherson closed 3 years ago

mcpherson commented 4 years ago

Maybe it's just because it's my focus, but I think CWL parsing is going to be a nightmare to write ourselves (even if it is for a perfect subset of CWL). I'm going to spend a little time looking at alternatives:

1) Try and connect with the CWL people and attend some of their phone conferences. I would like to ask them if they have any suggestions. 1) Looking at a recent cwltool issue has piqued my interest. Would it be possible to hack cwltool to parse and execute a CWL file where the execute step would load steps into the database instead of actually running them? I need to explore this and other potential hacks since cwltool acts a CWL verification tool (if it runs in cwltool it's good).

mcpherson commented 4 years ago

Another thing that prompted this issue is the notion of a public release of BEE. If we release anything, some people are going to expect to run their CWL workflows on it. It is seriously scoped down now. We can explain this restriction, but people usually ignore this and complain anyway. If we expect people to use this, we're going to have to support a more robust subset of CWL. That's going to be hard. Hence, the look for alternative parsing solutions so we can concentrate on our value added features. It's worth a look.

mcpherson commented 4 years ago

I posted a comment on the cwltool PR mentioned above.

trandles-lanl commented 4 years ago

The reply to your comment and the meeting minutes where they discuss MPI are very helpful. We should track that discussion closely.

mcpherson commented 4 years ago

Sent email to Peter Amstutz (cwltool developer).

mcpherson commented 4 years ago

As we progress to more complicated workflows, I think we might find that the workflow's DAG is not static (as produced by parsing the CWL file). Take globbing for example. Imagine one step that produces an unknown number of output files (say *.dat in a given directory) that feeds an analysis step. In other words, the analysis step is dependent on *, where the names and numbers of them are undefined until runtime. In this case, the task manager will need to return a list of files generated (it'll know where to look because the output directories are specified in the CWL). We might have a temporary, synthetic (e.g. TASK_ID.*.dat) dependency that must be expanded to the real dependencies (e.g. 0000.dat, 0001.dat, etc.) that are fed to the subsequent analysis task.

It just keeps getting more fun.

guanxyz commented 4 years ago

This example is very interesting. There are many uncertainties in some examples. However, I guess this may also exist in "regular manual scripts", if people want to automate their workflow. The exception I could image is, during the workflow, some interactive operations may require for the input from the code programmers (e.g., 000.dat, 0001.dat, but only 000.dat includes the data needed for next step analysis). It will be good to treat it as a special case and asks for customization. Thoughts?

Boogie3D commented 3 years ago

@mcpherson and I are looking at CWL-Airflow as an example of how to implement cwltool parsing into a larger project. An example of how cwltool is used to parse CWL files is found here.

Boogie3D commented 3 years ago

There are potentially some issues with parsing using cwltool:

Further research is needed to see if these errors are fatal or if fixes/workarounds exist.

Boogie3D commented 3 years ago

This is the output I get when trying to load this file with cwltool:

Could not load extension schema https://schema.org/version/latest/schema.rdf: Error fetching https://schema.org/version/latest/schema.rdf: 404 Client Error: Not Found for url: https://schema.org/version/latest/schema.rdf
Could not load extension schema https://schema.org/version/latest/schema.rdf: Error fetching https://schema.org/version/latest/schema.rdf: 404 Client Error: Not Found for url: https://schema.org/version/latest/schema.rdf
Could not load extension schema https://schema.org/version/latest/schema.rdf: Error fetching https://schema.org/version/latest/schema.rdf: 404 Client Error: Not Found for url: https://schema.org/version/latest/schema.rdf
Could not load extension schema https://schema.org/version/latest/schema.rdf: Error fetching https://schema.org/version/latest/schema.rdf: 404 Client Error: Not Found for url: https://schema.org/version/latest/schema.rdf
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
bam-bedgraph-bigwig-single.cwl:297:17: object id `http://orcid.org/0000-0002-6486-3898` previously defined
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
bam-bedgraph-bigwig-single.cwl:297:17: object id `http://orcid.org/0000-0002-6486-3898` previously defined
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
bam-bedgraph-bigwig-single.cwl:297:17: object id `http://orcid.org/0000-0002-6486-3898` previously defined
Warning: Field `$schemas` contains undefined reference to `https://schema.org/version/latest/schema.rdf`
bam-bedgraph-bigwig-single.cwl:297:17: object id `http://orcid.org/0000-0002-6486-3898` previously defined
bam-bedgraph-bigwig-single.cwl:297:17: object id `http://orcid.org/0000-0002-6486-3898` previously defined
test:1:1: JSHINT:     return inputs.output_filename ? inputs.output_filename : default_output_filename();
test:1:1: JSHINT:                                                              ^
test:1:1: JSHINT: W117: 'default_output_filename' is not defined.
bam-bedgraph-bigwig-single.cwl:249:7: JSHINT:     return inputs.output_filename ? inputs.output_filename : default_output_filename();
bam-bedgraph-bigwig-single.cwl:249:7: JSHINT:                                                              ^
bam-bedgraph-bigwig-single.cwl:249:7: JSHINT: W117: 'default_output_filename' is not defined.
test:1:1: JSHINT: (function(){return ((get_output_filename()));})()
test:1:1: JSHINT:                      ^
test:1:1: JSHINT: W117: 'get_output_filename' is not defined.
bam-bedgraph-bigwig-single.cwl:481:7: JSHINT: (function(){return ((get_output_filename()));})()
bam-bedgraph-bigwig-single.cwl:481:7: JSHINT:                      ^
bam-bedgraph-bigwig-single.cwl:481:7: JSHINT: W117: 'get_output_filename' is not defined.
bam-bedgraph-bigwig-single.cwl:611:13: JSHINT:       return default_output_filename();
bam-bedgraph-bigwig-single.cwl:611:13: JSHINT:              ^
bam-bedgraph-bigwig-single.cwl:611:13: JSHINT: W117: 'default_output_filename' is not defined.
bam-bedgraph-bigwig-single.cwl:630:13: JSHINT:       return default_output_filename();
bam-bedgraph-bigwig-single.cwl:630:13: JSHINT:              ^
bam-bedgraph-bigwig-single.cwl:630:13: JSHINT: W117: 'default_output_filename' is not defined.

Upon further inspection of the CWL file in question, these errors might actually be specific to it. The function default_output_filename() is defined by an embedded JavaScript block in the CWL but it might not be parsing it correctly. Additionally, the link to the schema is provided by the file itself.

I'll test more CWL files to see if this is a recurring issue.

Boogie3D commented 3 years ago

cwltool.load_tool, given a path to a CWL file, returns a Workflow object, which contains a tool attribute that holds pretty much all of the workflow information we will probably need to construct a DAG. The data structure used by tool is pretty messy, essentially many embedded ordereddict objects and lists thereof, so I will try to investigate more convenient data structures (i.e. WorkflowJob/WorkflowJobStep).

mcpherson commented 3 years ago

Largely complete. Investigated alternates. Decided to support subset of CWL and develop that parser ourselves as part of complex workflow upgrade. Recommend closing with a more detailed closure comment.

mcpherson commented 3 years ago

Supporting subset of CWL with a parser we develop. Closing.