pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Proposal: introduce structure to sample yaml output #284

Closed nsheff closed 3 years ago

nsheff commented 4 years ago

Originally noticed by @nsheff in https://github.com/pepkit/looper/issues/283#issuecomment-680057864

This is the sample yaml produced by looper for the pep in https://github.com/pepkit/pep-cwl:

sample_name: frog_1
library: anySampleType
file: data/frog1_data.txt
pipeline_interfaces: cwl_interface.yaml
prj:
  pep_version: 2.0.0
  sample_table: /home/nsheff/code/incubator/learn_cwl/cwl-pep/file_list.csv
  sample_modifiers:
    append:
      pipeline_interfaces: cwl_interface.yaml
  looper:
    output_dir: pipeline_results
input_file_size: 2.60770320892334e-08
all_inputs:
- data/frog1_data.txt
required_inputs:
- data/frog1_data.txt
files:
- file
required_files:
- file
yaml_file: pipeline_results/submission/frog_1.yaml

Notice the schema and sample attributes are attached in parallel. This is problem because they could overwrite each other. For example, if the sample had an attribute called files or required_files or yaml_file or prj or all_inputs, what would happen?

I would suggest this yaml writer should instead use a sample or sample_attributes subsection for the direct sample attributes. This would require changing any downstream pipelines that relied on the current format (which is I think mostly @afrendeiro's pipelines?). Unfortunately this current approach is not really a good model.

nsheff commented 3 years ago

@stolarczyk do you have any ideas for how to integrate this into the new plugin system? Would this actually be a peppy issue?

For CWL, we will have to continue to write the sample attributs to the top level -- but maybe we just skip the schema attributes?

the thing is, this change involves the to_yaml command.

I guess do we need to have the schema attributes appended at the top level? why are the schema attributes appended at all?

nsheff commented 3 years ago

here's an idea. what if the default to_yaml just output all the looper variable namespaces? so the sample yaml would be something like:

sample:
  sample_name: blahblah
  ...
project: 
  ...
pipeline: 
  ...
looper:
  ...
compute:
  ...

If the schema is important here, then it would be a separate namespace and maybe that would mean the schema should be a separate namespace available for the command templates as well.

nsheff commented 3 years ago

In this approach, we'd ajdust the Sample.to_yaml method, maybe change its name/location, it would provide a yaml-ish thing that has a sample and project subcomponents; looper would add to these the looper components and have a to_yaml function.

nsheff commented 3 years ago

Ok, here's what we determined to do: