Proposal: introduce structure to sample yaml output

nsheff commented 4 years ago

Originally noticed by @nsheff in https://github.com/pepkit/looper/issues/283#issuecomment-680057864

This is the sample yaml produced by looper for the pep in https://github.com/pepkit/pep-cwl:

sample_name: frog_1
library: anySampleType
file: data/frog1_data.txt
pipeline_interfaces: cwl_interface.yaml
prj:
  pep_version: 2.0.0
  sample_table: /home/nsheff/code/incubator/learn_cwl/cwl-pep/file_list.csv
  sample_modifiers:
    append:
      pipeline_interfaces: cwl_interface.yaml
  looper:
    output_dir: pipeline_results
input_file_size: 2.60770320892334e-08
all_inputs:
- data/frog1_data.txt
required_inputs:
- data/frog1_data.txt
files:
- file
required_files:
- file
yaml_file: pipeline_results/submission/frog_1.yaml

Notice the schema and sample attributes are attached in parallel. This is problem because they could overwrite each other. For example, if the sample had an attribute called files or required_files or yaml_file or prj or all_inputs, what would happen?

I would suggest this yaml writer should instead use a sample or sample_attributes subsection for the direct sample attributes. This would require changing any downstream pipelines that relied on the current format (which is I think mostly @afrendeiro's pipelines?). Unfortunately this current approach is not really a good model.

nsheff commented 3 years ago

@stolarczyk do you have any ideas for how to integrate this into the new plugin system? Would this actually be a peppy issue?

For CWL, we will have to continue to write the sample attributs to the top level -- but maybe we just skip the schema attributes?

the thing is, this change involves the to_yaml command.

I guess do we need to have the schema attributes appended at the top level? why are the schema attributes appended at all?

nsheff commented 3 years ago

here's an idea. what if the default to_yaml just output all the looper variable namespaces? so the sample yaml would be something like:

sample:
  sample_name: blahblah
  ...
project: 
  ...
pipeline: 
  ...
looper:
  ...
compute:
  ...

If the schema is important here, then it would be a separate namespace and maybe that would mean the schema should be a separate namespace available for the command templates as well.

nsheff commented 3 years ago

In this approach, we'd ajdust the Sample.to_yaml method, maybe change its name/location, it would provide a yaml-ish thing that has a sample and project subcomponents; looper would add to these the looper components and have a to_yaml function.

nsheff commented 3 years ago

Ok, here's what we determined to do:

the sample yaml should not include input schema stuff. it should really be a sample yaml: https://github.com/pepkit/peppy/issues/356
peppy needs the ability to write yaml either with or without the project embedded: https://github.com/pepkit/peppy/issues/355
two looper functions wrap each of the peppy sample to_yaml function versions (one with prj embedded and one without),these functions can serve as plugin functions. https://github.com/pepkit/looper/issues/299
looper will no longer direclty call the 'to yaml' function on peppy; it will call via these plugin functions https://github.com/pepkit/looper/issues/299
looper plugin functions will have to define how they want their output file location specified, just as the submission object one does. https://github.com/pepkit/looper/issues/299
we should add a page documenting all these builtin looper plugin functions and how to parameterize them https://github.com/pepkit/looper/issues/299
a new plugin for the whole shebang. https://github.com/pepkit/looper/issues/298

pepkit / looper

Proposal: introduce structure to sample yaml output #284