Closed nsheff closed 3 years ago
Would appreciate thoughts from @stolarczyk and @afrendeiro
In process of working on this, I've found a few related issues. This is the sample yaml produced by looper for the pep in https://github.com/pepkit/pep-cwl:
sample_name: frog_1
library: anySampleType
file: data/frog1_data.txt
pipeline_interfaces: cwl_interface.yaml
prj:
pep_version: 2.0.0
sample_table: /home/nsheff/code/incubator/learn_cwl/cwl-pep/file_list.csv
sample_modifiers:
append:
pipeline_interfaces: cwl_interface.yaml
looper:
output_dir: pipeline_results
input_file_size: 2.60770320892334e-08
all_inputs:
- data/frog1_data.txt
required_inputs:
- data/frog1_data.txt
files:
- file
required_files:
- file
yaml_file: pipeline_results/submission/frog_1.yaml
Notice the schema and sample attributes are attached in parallel. This is problem because they could overwrite each other. For example, if the sample had an attribute called files
or required_files
or yaml_file
or prj
or all_inputs
, what would happen?
I would suggest this yaml writer should instead use a sample
or sample_attributes
subsection for the direct sample attributes. This would require changing any downstream pipelines that relied on the current format (which is I think mostly @afrendeiro's pipelines?). Unfortunately this current approach is not really a good model.
Then I'd propose the CWL approach would be to output an additional file that follows the cwl outline, which would only include attributes under the new sample
section (so, no prj
or schema-associated stuff), but would include the CWL handling for File
and Directory
attributes.
yeah, sourcing this information from schema is a natural choice. I don't think there's any other possibility.
As for the implementation of that, we also have little room for manoeuvre. That said, I'd probably make it more generic and instead of cwl_compatible=True
+ list of Directories
and Files
I'd provide some kind of "schema_format_mapping" that would be optional and looked like this:
{
"Directory": {"sample_attr_value_key": "location", "sample_attrs": ["dir1", "dir2", "dir3"]},
"File": {"sample_attr_value_key": "path", "sample_attrs": ["file6", "file4", "file5"]}
}
if this argument is provided, Sample
attributes specified in sample_attrs
are not reported as simple key-value pairs but with a specified class
and value keyed by string specified in sample_attr_value_key
Not sure we need to go to that complexity since this will likely only ever be used for CWL.
Ok this seems to work, with the caveats of #284 which are independent of this issue. I need to make it optional though.
Hi! I see no issues with moving sample attributes to sample
or sample_attributes
in the yaml. It's a small change downstream.
One comment though: I'm not too familiar with CWL but it kind of seems like they're abusing YAML in order to have types as fields :/ Why aren't there types for every attribute (e.g. why isn't my_attr
in the example of class String
or something?). If that's what CWL is, fine we aren't going to change that, but it seems like we're mixing quite some things here.
Why not simply have looper implement a to_cwl(sample: peppy.Sample)
method that saves a .cwl
file in addition to the yaml if the user so wants?
Why not simply have looper implement a to_cwl(sample: peppy.Sample) method that saves a .cwl file in addition to the yaml if the user so wants?
Yes, this is exactly what I suggested above! But actually I made it even better with #285.
t kind of seems like they're abusing YAML in order to have types as fields :/ Why aren't there types for every attribute (e.g. why isn't my_attr in the example of class String or something?). If that's what CWL is, fine we aren't going to change that, but it seems like we're mixing quite some things here.
Well, that's definitely what CWL is, and I agree that it's a bit strange. I do not quite follow the rationale for doing it this way but I'm sure there is one.
Oh I see.
But just to be clear, the cwl_yaml
atrribute would be a path to a file with .cwl
ending right? Even though it is also valid YAML?
So #284 is independent of CWL, its sole purpose is to avoid clash between user-defined attributes and others that are added by tools like looper, correct?
But just to be clear, the cwl_yaml atrribute would be a path to a file with .cwl ending right?
No, it's not a cwl file, it's a yaml file that's used as a job description... the cwl
files refer to tools or workflows, not to yaml inputs. The inputs are actually .yml
files. cwl
+ yml
= workflow run. it's kind of similar to pypiper with sample yaml inputs actually.
So #284 is independent of CWL, its sole purpose is to avoid clash between user-defined attributes and others that are added by tools like looper, correct?
Yes, exactly! -- and also by peppy/eido when it manages the PEP schemas (right now for example I think it's peppy or eido that adds the files
attribute)
Sounds good.
No, it's not a cwl file, it's a yaml file that's used as a job description... the cwl files refer to tools or workflows, not to yaml inputs. The inputs are actually .yml files.
That's strange, but okay. Maybe we can suffix it with "cwl.yaml" or some way to make it clear is the cwl
format?
Here I outlined how looper can be used as a tabular scatterer for CWL workflows: https://github.com/pepkit/pep-cwl. At first I did this with a small hack to looper, but I want to figure out the best way to do it long term.
Right now, when peppy converts a sample object to yaml, it just creates a vary simple key-value mapping:
This actually works as inputs to a CWL run --- except the issue is that CWL specifies two data types that treated specially:
File
andDirectory
. For File and Directory, it expects the YAML look like:So for peppy to produce a CWL-comptabile yaml, we have to slightly tweak the yaml output for anything that the CWL tool expects as a
File
orDirectory
. I just hacked this into looper for the attribute I knew was a file in the example to prove I could do it. How can we do this universally?Well, looper knows which attributes are
files
from the PEP schema. iI the PEP schema could also specify aDirectory
type, then... we could change thepeppy.Sample.to_yaml
function to output any of those defined attributes in this format. I would suggest this be an option toto_yaml
calledcwl_compatible=True
-- because some tools want the raw values (and to me, it actually makes more sense to keep the schema separate from the input data anyway). But this way we can handle both approaches.One potential issue is that
to_yaml
is a peppy function, while eido and looper care about schemas -- so how do we keep peppy from requiring schemas but still allowto_yaml
to produce cwl-compatible output? One answer is that theto_yaml
function could take a list ofFile
attributes and a list ofDirectory
attributes and make the change, and then it's the reponsibility of whoever calls it (in this case it's looper) to provide the information from the schema.