pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Introducing CWL compatibility #283

Closed nsheff closed 3 years ago

nsheff commented 4 years ago

Here I outlined how looper can be used as a tabular scatterer for CWL workflows: https://github.com/pepkit/pep-cwl. At first I did this with a small hack to looper, but I want to figure out the best way to do it long term.

Right now, when peppy converts a sample object to yaml, it just creates a vary simple key-value mapping:

my_attr: value-of-attr
my_file: value/of/file/attribute.fasta.gz
my_dir: value/of/dir/attr/

This actually works as inputs to a CWL run --- except the issue is that CWL specifies two data types that treated specially: File and Directory. For File and Directory, it expects the YAML look like:

my_attr: value-of-attr
my_file: 
  class: File
  path: value/of/file/attribute.fasta.gz
my_dir: 
  class: Directory
  location: value/of/dir/attr/

So for peppy to produce a CWL-comptabile yaml, we have to slightly tweak the yaml output for anything that the CWL tool expects as a File or Directory. I just hacked this into looper for the attribute I knew was a file in the example to prove I could do it. How can we do this universally?

Well, looper knows which attributes are files from the PEP schema. iI the PEP schema could also specify a Directory type, then... we could change the peppy.Sample.to_yaml function to output any of those defined attributes in this format. I would suggest this be an option to to_yaml called cwl_compatible=True -- because some tools want the raw values (and to me, it actually makes more sense to keep the schema separate from the input data anyway). But this way we can handle both approaches.

One potential issue is that to_yaml is a peppy function, while eido and looper care about schemas -- so how do we keep peppy from requiring schemas but still allow to_yaml to produce cwl-compatible output? One answer is that the to_yaml function could take a list of File attributes and a list of Directory attributes and make the change, and then it's the reponsibility of whoever calls it (in this case it's looper) to provide the information from the schema.

nsheff commented 4 years ago

Would appreciate thoughts from @stolarczyk and @afrendeiro

nsheff commented 4 years ago

In process of working on this, I've found a few related issues. This is the sample yaml produced by looper for the pep in https://github.com/pepkit/pep-cwl:

sample_name: frog_1
library: anySampleType
file: data/frog1_data.txt
pipeline_interfaces: cwl_interface.yaml
prj:
  pep_version: 2.0.0
  sample_table: /home/nsheff/code/incubator/learn_cwl/cwl-pep/file_list.csv
  sample_modifiers:
    append:
      pipeline_interfaces: cwl_interface.yaml
  looper:
    output_dir: pipeline_results
input_file_size: 2.60770320892334e-08
all_inputs:
- data/frog1_data.txt
required_inputs:
- data/frog1_data.txt
files:
- file
required_files:
- file
yaml_file: pipeline_results/submission/frog_1.yaml

Notice the schema and sample attributes are attached in parallel. This is problem because they could overwrite each other. For example, if the sample had an attribute called files or required_files or yaml_file or prj or all_inputs, what would happen?

I would suggest this yaml writer should instead use a sample or sample_attributes subsection for the direct sample attributes. This would require changing any downstream pipelines that relied on the current format (which is I think mostly @afrendeiro's pipelines?). Unfortunately this current approach is not really a good model.

Then I'd propose the CWL approach would be to output an additional file that follows the cwl outline, which would only include attributes under the new sample section (so, no prj or schema-associated stuff), but would include the CWL handling for File and Directory attributes.

stolarczyk commented 4 years ago

yeah, sourcing this information from schema is a natural choice. I don't think there's any other possibility.

As for the implementation of that, we also have little room for manoeuvre. That said, I'd probably make it more generic and instead of cwl_compatible=True + list of Directories and Files I'd provide some kind of "schema_format_mapping" that would be optional and looked like this:

{
    "Directory": {"sample_attr_value_key": "location", "sample_attrs": ["dir1", "dir2", "dir3"]},
    "File": {"sample_attr_value_key": "path", "sample_attrs": ["file6", "file4", "file5"]}
}

if this argument is provided, Sample attributes specified in sample_attrs are not reported as simple key-value pairs but with a specified class and value keyed by string specified in sample_attr_value_key

nsheff commented 4 years ago

Not sure we need to go to that complexity since this will likely only ever be used for CWL.

nsheff commented 4 years ago

Ok this seems to work, with the caveats of #284 which are independent of this issue. I need to make it optional though.

afrendeiro commented 4 years ago

Hi! I see no issues with moving sample attributes to sample or sample_attributes in the yaml. It's a small change downstream.

One comment though: I'm not too familiar with CWL but it kind of seems like they're abusing YAML in order to have types as fields :/ Why aren't there types for every attribute (e.g. why isn't my_attr in the example of class String or something?). If that's what CWL is, fine we aren't going to change that, but it seems like we're mixing quite some things here. Why not simply have looper implement a to_cwl(sample: peppy.Sample) method that saves a .cwl file in addition to the yaml if the user so wants?

nsheff commented 4 years ago

Why not simply have looper implement a to_cwl(sample: peppy.Sample) method that saves a .cwl file in addition to the yaml if the user so wants?

Yes, this is exactly what I suggested above! But actually I made it even better with #285.

t kind of seems like they're abusing YAML in order to have types as fields :/ Why aren't there types for every attribute (e.g. why isn't my_attr in the example of class String or something?). If that's what CWL is, fine we aren't going to change that, but it seems like we're mixing quite some things here.

Well, that's definitely what CWL is, and I agree that it's a bit strange. I do not quite follow the rationale for doing it this way but I'm sure there is one.

afrendeiro commented 4 years ago

Oh I see. But just to be clear, the cwl_yaml atrribute would be a path to a file with .cwl ending right? Even though it is also valid YAML? So #284 is independent of CWL, its sole purpose is to avoid clash between user-defined attributes and others that are added by tools like looper, correct?

nsheff commented 4 years ago

But just to be clear, the cwl_yaml atrribute would be a path to a file with .cwl ending right?

No, it's not a cwl file, it's a yaml file that's used as a job description... the cwl files refer to tools or workflows, not to yaml inputs. The inputs are actually .yml files. cwl + yml = workflow run. it's kind of similar to pypiper with sample yaml inputs actually.

So #284 is independent of CWL, its sole purpose is to avoid clash between user-defined attributes and others that are added by tools like looper, correct?

Yes, exactly! -- and also by peppy/eido when it manages the PEP schemas (right now for example I think it's peppy or eido that adds the files attribute)

afrendeiro commented 4 years ago

Sounds good.

No, it's not a cwl file, it's a yaml file that's used as a job description... the cwl files refer to tools or workflows, not to yaml inputs. The inputs are actually .yml files.

That's strange, but okay. Maybe we can suffix it with "cwl.yaml" or some way to make it clear is the cwl format?