A plugin system for pre-submission hooks to modify samples and change yaml representations

nsheff commented 4 years ago

Related to #284 and #283.

We originally wrote a sample yaml file. Now I think that format should change (#284). And, we're wanting to write a slightly tweaked version for CWL inputs (#283). What if there's other stuff we want to do with the sample? Well -- what if we instead implemented a plugin system?

looper, before job submission, would call a pre_submit() hook function where we currently run the to_yaml process. Users could write python packages that would provide functions that would be called by pre_submit. Plugin functions must:

take as input exactly a Sample object.
return as output exactly a (potentially modified) Sample object.
(optionally), they could do whatever else they want to do (like produce a file).

For example, a cwl plugin would be a package is called looper_cwl, which provides a function called add_cwl_yaml(). Then, In the pipeline interface, you'd specify which plugin functions to run, with pre_submit_plugin like this:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit_plugin: looper_cwl.add_cwl_yaml
command_template: >
  cwl-runner {pipeline.path} {sample.cwl_yaml}

In this case, the cwl plugin would provide a new attribute, called cwl_yaml, so you can use {sample.cwl_yaml} in the command template as shown. This function would also write the cwl-version yaml to the file that would be specified by the .cwl_yaml attribute.

We could start by implementing:

a cwl pre-hook plugin, which would write the cwl format
a current-version yaml plugin
a new version yaml plugin that solves #284

nsheff commented 4 years ago

Ok, I've implemented this. I really like this idea. I have only one complaint, which I raised in #288.

With this approach, you'd use this kind of a pipeline interface for a CWL pipeline:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  - looper.add_cwl_yaml
command_template: >
  cwl-runner {pipeline.path} {sample.cwl_yaml}

and you'd use this kind of approach for an @afrendeiro sample yaml pipeline:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  - looper.add_basic_yaml
command_template: >
  cwl-runner {pipeline.path} {sample.yaml_file}

The only question is: should the basic yaml approach be built-in? Or should we require you to specify this pre_submit? Arguments for requiring it to be explicit is that for pipelines that don't need the sample yaml written, it costs disk space and time to write all those.

Arguments for doing it anyway: maybe it's nice to have those yaml files as a record of what was run.

At the moment I'm leaning toward following the 'explicit is better than implicit' rule-of-thumb, but I could be persuaded otherwise.

Thoughts?

nsheff commented 4 years ago

@afrendeiro I'd like to know if you have any objections to requiring the pipeline interface for any pipelines that want the sample yaml to include pre_submit: looper.add_basic_yaml.

Or, should we simply make that run all the time?

afrendeiro commented 4 years ago

No objections, but would prefer if done automatically. Could it be inferred by the use of "{sample.yaml_file}" in the command_template value? Or probably better if adding the sample_yaml_path key as described here? http://looper.databio.org/en/latest/pipeline-interface-specification/#sample_yaml_path

nsheff commented 4 years ago

Yeah that's better for backwards compatibility, but in that case, would there be a way to turn it off? Or would it always produce that file, and then could just produce others as well?

Could it be inferred by the use of "{sample.yaml_file}" in the command_template value?

Hmm...possibly, but there's nothing tying those together, really. I don't really like that idea.

Or probably better if adding the sample_yaml_path key

Well, I had proposed a revision that would generalize that concept here: https://github.com/pepkit/looper/issues/288#issuecomment-681060642. what do you think of that?

I guess this issue is that this particular key, if made generic, would just be a value to be used.

I'm just trying to make the thing generic and modular, so it's easier to accomodate this functionality as "just another way" rather than being a special case, I guess.

What about just making it so that if there is no entry in pre_submit, it assumes you want looper.add_basic_yaml. but if there is one, then dyou'd have to add that in to get it?

afrendeiro commented 4 years ago

You're right, it should be a general thing. I warmed up to the idea of pre_submit. I'm fine with writing the yaml only with a pre_submit section.

Are there other things that would go under it? I guess the CWL, right? Like it could be a list with functions to be called, kind of like a before_script section in a .travis.yml?

Although the basic part is a bit weird. Maybe something like "write_sample_yaml". In the end, could it maybe be something like this?

pre_submit:
  - looper.write_sample_yaml
  - looper.write_sample_cwl

nsheff commented 4 years ago

what you wrote is almost exactly how it is now! You just changed the names of the built-in functions. I just meant "basic" was the built-in yaml; the cwl one is also a yaml. I think you're confused about the terminology there; it's a yaml file that's used as input to parameterize a cwl -- it's not a cwl file. So I might do write_sample_cwl_yaml. but anyway, yes.

pre_submit:
  - looper.write_sample_yaml
  - looper.write_sample_cwl

as built-in options makes sense to me. then, users can write their own stuff and add it like

pre_submit:
  - my_package.my_function

the downside here is that existing tools that require the sample yamls will have to update the pipeline interface to be compatible with the new looper version. so, we break backwards compatibility.

The remianig questions are:

should we also make generic the path to the sample yaml file? https://github.com/pepkit/looper/issues/288#issuecomment-681060642
is "pre_submit" the right name for the interface key? maybe pre_submit_hook or pre_submit_functions ?
should we keep looper.write_sample_yaml as a default (if no pre_submit section specified) for a few cycles to maintain backwards compatibility, with a warning?

afrendeiro commented 4 years ago

Ah sorry I didn't see the example above again... Sorry I'm distracted :/

the downside here is that existing tools that require the sample yamls will have to update the pipeline interface to be compatible with the new looper version. so, we break backwards compatibility.

Is this just open_pipelines or do you know of something else using it?

should we keep looper.write_sample_yaml as a default (if no pre_submit section specified) for a few cycles to maintain backwards compatibility,

Yeah that makes sense to me. But if it is really just open_pipelines we can coordinate and drop it as soon as that is undergoing some maintenance now anyway.

I'm okay with pre_submit as a key name.

nsheff commented 4 years ago

Is this just open_pipelines or do you know of something else using it?

I think I'm using it in a few cases. Can't remember exactly though...

nsheff commented 4 years ago

Should it be possible to put a script in there? Maybe instead of a list of python functions, we make it operate the same way the dynamic_variables_command_template works? We had originally thought about implementing that as python functions, like this. The difference is just that the inputs to those functions we envisioned as being very simple, like a number or a string, whereas the inputs to these pre_hook functions are the entire sample object -- and maybe even also the other variable namespace objects. Still, the similarity between these two things is interesting to the point that I'm wondering if there's something even more general possible here.

for example, the dynamic_variables_command_template is essentially taking as input some sample attributes, and modifying the compute namespace before submitting.

the write_sample_yaml and write_sample_cwl are taking sample attributes and writing an output file before submitting.

if we made the pre_submit idea more general than just taking the sample as input, and allow it to actually modify the variables, then the dynamic_variables_command_template job could also be accomplished with the pre_hook machinery.

afrendeiro commented 4 years ago

Yeah I guess compute variables could be populated this way if there are functions that take the sample as input and add those attributes. I'm assuming the called functions are allowed to modify the sample object, right? (don't even think there's a mechanism to prevent that in Python since everything is mutable by default)

nsheff commented 4 years ago

well, if it's happening as a script rather than a function then it can't. this is already how it works with the dynamic compute variables. it prints out variables, which are then read in by looper and used to modify the compute variables.

afrendeiro commented 4 years ago

Oh I see, I wasn't familiar with. Well, that script could return a YAML/JSON version of the sample too I guess. Probably overkill though.

nsheff commented 4 years ago

Two ideas for how to implement this that checks all these boxes:

Option 1

The pre_submit section has a subsection for each looper variable namespace

pre_submit:
  project:
    python_functions:
      - ...
    command_template: null
  sample:
    python_functions:
      - ...
    command_template:
      - ...
  pipeline:
    python_functions:
      - ...
    command_template:
      - ...
  looper:
    python_functions:
      - ...
    command_template:
      - ...
  compute:
    python_functions:
      - ...
    command_template:
      - ...

So, for example:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  sample:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - dir/build-sample-params.py -a {sample.attr_a} -b {sample.attr_b}
  compute:
    command_template:
      - build-compute-params.py -a {sample.asset}
command_template: >
  cwl-runner {pipeline.path} {sample.yaml_file}

the compute.command_template would supplant the dynamic variables command template functionality.

it returns:

{
    "mem": "6000"
}

You can use these sections to modify any of the looper namespaces.

Option 2

There are no sub-sections for namespaces, but all namespaces are included in a dict.

pre_submit:
  python_functions:
    - ...
  command_template:
    - ...

example:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - dir/build-sample-params.py -a {sample.attr_a} -b {sample.attr_b}
      - build-compute-params.py -a {sample.asset}
command_template: >
  cwl-runner {pipeline.path} {sample.yaml_file}

In this case the external script would return:

{
    "sample": {},
    "compute": {
        "mem": "6000"
    }
}

The function would have to return a dict with a Sample object as one item, etc. This second one is more universal; it allows you to use information from all namespaces to modify any of them, rather than dividing it out. but, the objects passed to and from are a bit heavier. on the other hand, they're more explicit.

nsheff commented 4 years ago

Currently the python_functions are:

def write_sample_yaml(sample, subcon=None):
    """
    Produce a complete, basic yaml representation of the sample

    :param peppy.Sample sample: A sample object
    """    
    _LOGGER.info("Calling write_sample_yaml plugin.")
    sample.to_yaml(subcon._get_sample_yaml_path(sample))
    return(sample)

For the more universal version we'd do something like:

def write_sample_yaml(sample, compute, looper, pipeline):
    _LOGGER.info("Calling write_sample_yaml plugin.")
    sample.to_yaml(subcon._get_sample_yaml_path(sample))

    return_value = {
        "sample": sample,
        "compute": compute,
        "looper": looper,
        "pipeline": pipeline
    }

    return(return_value)

Or maybe:

def write_sample_yaml(namespaces):
    """
    Produce a complete, basic yaml representation of the sample

    :param dict namespaces: A dict with namespaces: sample, pipeline, looper, compute
    """    
    sample = namespaces["sample"]
    compute = namespaces["compute"]
    looper = namespaces["looper"]
    pipeline = namespaces["pipeline"]

    _LOGGER.info("Calling write_sample_yaml plugin.")
    sample.to_yaml(subcon._get_sample_yaml_path(sample))

    return_value = {
        "sample": sample,
        "compute": compute,
        "looper": looper,
        "pipeline": pipeline
    }

    return(return_value)

stolarczyk commented 4 years ago

I like the idea of a single pre_submit section taking care of all the namespace management!

As for implementation/format selection:

From this comment I'd prefer "Option 2", for the reasons you named there. And I don't think the script from command_template would need to return empty dicts for namespaces that are not to be modified, did you put "sample": {} on purpose there?

From this comment I don't have preference, but again, I like that all the namespace handling could happen in single hook fucntion.

So, if we implemented that, the problem we talked about in #288 goes away? The target YAML path for sample can be created from any namespace attributes in a hook function.

nsheff commented 4 years ago

I don't think the script from command_template would need to return empty dicts for namespaces that are not to be modified,

I agree -- I didn't mean it would return an empty dict, it was just a placeholder. Either the function and the script would only return whatever it wanted to update. null/missing entries would simply not be updated.

I like that all the namespace handling could happen in single hook fucntion.

Yeah. for these functions, the advantage of the signature being (sample, compute, pipeline, looper) is that you have everything directly right there. the disadvantage is that if we change in the future, those external functions would lose compatibility. the function(namespaces_dict) signature is more universal, I guess.

So, if we implemented that, the problem we talked about in #288 goes away?

I think it changes but doesn't get solved completely; I still think we'd want to write a built-in function that accepted some sample attributes and gave a decent path, or something. with this new approach, I think it becomes easier because now those functions will have access to all the namespace variables directly, that's what you mean right?

stolarczyk commented 4 years ago

I'm trying to implement this and found a potential problem: there's no easy way to make the scripts used in piface.pre_submit.command_template portable. For piface.compute.dynamic_variables_command_template we solved it by specifying an additional key piface.compute.dynamic_variables_script_path that is made relative to the pipeline interface and can be used in the command template. Since in this case there can be multiple templates we can't follow the same scheme. Any ideas how to solve it?

Should we allow only one command? Or do this:

pre_submit:
    my_path1: dir/build-sample-params.py
    path2: build-compute-params.py
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.pre_submit.path2}  -a {sample.asset}

we would need to assume that every str attribute in pre_submit is a path that should be made relative to piface, or add another attribute that would list such attributes:

pre_submit:
    script_paths: [my_path1, path2]
    my_path1: dir/build-sample-params.py
    path2: build-compute-params.py
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.pre_submit.path2}  -a {sample.asset}

That gets a bit complex and it feels like "one command only" option makes the most sense.. IDK

nsheff commented 4 years ago

should we add a new relative_paths section for general use? because this is essentially the same thing as the top-level path key we allow now, right? Should we gather all of those into one place, which would be available via the pipeline namespace? This is close to what you're doing already, but they can be defined now in multiple places...

So I guess instead of this:

path: path/to/main/script.py
command_template: {pipeline.path} ...
pre_submit:
    script_paths: [my_path1, path2]
    my_path1: dir/build-sample-params.py
    path2: build-compute-params.py
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.pre_submit.path2}  -a {sample.asset}

I'm proposing this:

script_paths:
  main_path: path/to/main/script.py
  my_path1: dir/build-sample-params.py
  path2: build-compute-params.py
command_template: {pipeline.main_path} ...
pre_submit:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.path2}  -a {sample.asset}

stolarczyk commented 4 years ago

good thought. Maybe we can make it even more versatile and name this section "paths" -- this might be useful beyond scripts. For example, some configuration or sample input files could be shipped with pipelines and embedded in command templates.

nsheff commented 4 years ago

Yeah I wasn't sold on script_paths. Also think of relative_paths. But paths is fine. So in other word, our current path keyword just becomes paths and now accepts a list instead of a single path.

I guess it would be more straightforward than my original proposal if you referred to them as pipeline.path.path_key rather than just pipeline.path_key, -- the later would allow for clashes with other top-level variables. what do you think?

stolarczyk commented 4 years ago

So in other word, our current path keyword just becomes paths and now accepts a list instead of a single path.

Actually, as of now it's not a YAML list but mapping

paths:
  key: x

And thanks to that we can refer to these paths with: {pipeline.paths.<key>}. Therefore, all pifaces that we currently have need to update the pipeline script path source, unless we support the top level path key as well.

It's already implemented and seems to work, you can give it a try.

nsheff commented 4 years ago

ll pifaces that we currently have need to update the pipeline script path source, unless we support the top level path key as well.

Well, we could deprecate it; so, provide a warning but allow it for now? I think the same path should be followed for the dynamic_variables_command_template, which will be superseded.

It's already implemented and seems to work, you can give it a try.

wow great, thanks! are you referring to just the paths component, or also to the pre_submit proposal with functions and templates? I think my use case for the former is really going to be in a use of the latter...

stolarczyk commented 4 years ago

Well, we could deprecate it; so, provide a warning but allow it for now? I think the same path should be followed for the dynamic_variables_command_template, which will be superseded.

ok, agreed

are you referring to just the paths component, or also to the pre_submit proposal

everything this issue talks about, so you can do:

pipeline_name: PIPELINE1
pipeline_type: sample
paths:
  main: pipelines/pipeline1.py
  hook1: pipelines/hook1.py
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
    python_function:
      - looper.write_sample_yaml_basic
    command_template:
      - "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"

command_template: >
  {pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome}

python_functions take a dict with all namespaces as an input and need to return an updated one
- command_templates need to return a JSON with updates to the namespaces, structured as you proposed above

nsheff commented 4 years ago

Ok great! The PEP+CWL example is the best use case for the python function pre-submit, so I'll give it a try there.

and I'm using the dynamic variables on plantref so I'll try to convert that one to this new form. luckily I don't think there are many other examples of using the dynamic variables -- maybe @jpsmith5 used them in pepatac/peppro?

jpsmith5 commented 4 years ago

No, I don't use them currently, so it should be okay to adjust without bumping up against anything I've done.

nsheff commented 4 years ago

@stolarczyk I noticed: it seems that you're requiring the return value to be the whole set of namespaces, like this:

    return_value = {
        "sample": sample,
        "compute": compute,
        "looper": looper,
        "pipeline": pipeline,
        "project": project
    }
    return return_value

I'd rather just only have to return ones I'm updating. For example, in this example I'm only updating sample so I should be fine with:

    return {"sample" : sample }

but when you change to that you get an error:

## [2 of 2] sample: frog_2; pipeline: count_lines
Calling pre-submit function: looper.write_sample_yaml_basic
Missing sample, project or pipeline attributes required by command template: 'cwl-runner {pipeline.paths.main} {sample.yaml_file}
{%- if sample.command_extra is defined %} {sample.command_extra}   {% endif -%}'
> Not submitted: 'pipeline' is undefined

almost like it's only populating the sample namespace if it's run through the function.

nsheff commented 4 years ago

I just pushed an update with an example. you can try this:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
paths:
  main: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  python_function:
  - looper.write_cwl_yaml
command_template: >
  cwl-runner {pipeline.paths.main} {sample.cwl_yaml}

I changed write_sample_yaml_basic to just return {"sample" : sample }, which I think should work, but doesn't Also, I extracted some code into a new function: get_sample_yaml_path -- where do you think this should live? We should export it so that others can use it, I guess. Maybe with a param for filename or something.

stolarczyk commented 4 years ago

you're right, this is how it works and that's what I wrote above:

python_functions take a dict with all namespaces as an input and need to return an updated one

I thought that it's not a pain to return the entire namespaces dict since it's the input to the hook function. But I don't feel strongly about it.

nsheff commented 4 years ago

well, I think users will commonly not be aware of all these namespaces, so it will be simpler if they can return only the things that they updated. this is what I meant when I wrote:

Either the function and the script would only return whatever it wanted to update. null/missing entries would simply not be updated.

that seems to be clearer to me, if it's not harder to implement. do you see any disadvantages?

stolarczyk commented 4 years ago

that seems to be clearer to me, if it's not harder to implement

it's a bit more complex since otherwise the namespaces just become what's returned from the hook funcion. But I implemented the other way for the pre_submit.command_template anyway, so we will just need to reuse abstract that away and reuse.

do you see any disadvantages?

I don't see any disadvantages. Conversely, I see an advantage of the same output type requirement.

nsheff commented 4 years ago

Ok sounds good. I was thinking it might be as simple as just calling the update() functions instead of just replacing the objects. Or, you could just only replace the objects if the entry exists, I guess.

nsheff commented 4 years ago

Just realized in writing the docs... the new paths section is nice for recording multiple possibilities for script paths, but how are we going to handle the custom sample_yaml_path key? Right now this is used to specify where the sample yaml is put, but since we've made this flexible, this has to have some different mechanism...this represents additional information that is passed to the plugin functions.

it's not the same thing as paths...

paths:

intended for paths to scripts
are static paths to specific files, one per pipeline (not a different one per sample)
relative entries are automatically made relative to the pipeline interface
is available to custom plugin functions via pipeline.paths...

sample_yaml_path

intended for output produced by the pipeline
is a template with {namespace.variables} that are populated, which differ per sample
relative paths are interpreted as being relative to the looper.output_dir.
is not currently available to custom plugin functions (I think), which is a problem, since these functions are the ones that write the sample yaml files.

Do we need to generalize the sample_yaml_paths the same way we generalized path ? So you'd have something like output_paths:

pipeline_name: PIPELINE1
pipeline_type: sample
paths:
  main: pipelines/pipeline1.py
  hook1: pipelines/hook1.py
output_paths:
  sample_yaml_path: {looper.output_dir}/{sample.sample_name}.yaml
  some_other_path: {sample.sample_name}.txt
  sample_yaml_cwl: {sample.sample_name}_cwl.yaml
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
    python_function:
      - looper.write_sample_yaml_basic
    command_template:
      - "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"

command_template: >
  {pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome}

Then you can refer to those variables from within your plug-in function with: pipeline.output_paths.sample_yaml_cwl

Then a follow-up questions:

should we allow the input paths to use the namespace variables as well? It seems like there would be advantages to this.
should we change from paths to input_paths ?
In this case the only difference between input_paths and output_paths would be the implied relative directory. would it make more sense to organize these like:

paths:
  input:
    main: pipelines/pipeline1.py
  output:
    sample_yaml_path: {sample.sample_name}.yaml

An alternative idea: should we eliminate the implied relative interpretation and simply force you to state the relative folder if you want to use it? they are already available in some namespace variables: looper.output_dir for the output paths, and I guess we could add something like looper.piface_dir or something. The advantage of this is 1) things are explicit instead of implicit; and 2) there's no distinction between "input paths" and "output paths", so we'd be back again to just the paths variable.

Yet a third idea is to just keep it how it is, where paths are relative to the pipeline interface unless specified otherwise...and then just include a folder variable for any output paths, like looper.output_dir/blahblahblah.

thoughts?

stolarczyk commented 4 years ago

I thought that the point of the plugin functions is, among other things, to build the path to the sample yaml and write it since all the namespaces are available as an input. What is the advantage of creating the path in pipeline.paths.output and using it in the plugin function over just creating it in the plugin function?

nsheff commented 4 years ago

that it gives control to the pipeline runner, I suppose, rather than the pipeline author.

in other words, if I'm just using a pipeline, I may want things to be put in a different output space. I shouldn't have to write code to do that, it should be configurable. that's why we originally made the sample output path configurable, right?

right now that's no longer configurable, right?

stolarczyk commented 4 years ago

oh I see.. I considered the plugin creation the new "configuration".

I think this might be a source of confusion and potential problems. As a user I still need to go over the plugin code and make sure it uses the path I'm expecting it to (pipeline.paths.output.sample_yaml_path) and not some other path or even a custom one, created in the plugin funcion. So I guess what I'm trying to say is that the information you put in the piface could be modified without user's knowledge.

nsheff commented 4 years ago

Yeah, that makes some sense. I guess I see the plugins as being an intermediate step. they're not quite "built-in" (although the current ones are in fact built-in), but not quite as configurable as just tweaking a piface... so i don't expect there to be tons of plugins and I figure they should be reused.

The plugin author would have to say "to use this plugin, you must/may provide the following path or output_path variables: etc. Or, "this plugin can be configured in the following way: ...".

Essentially, we make the plugins configurable.

nsheff commented 4 years ago

actually this is basically already how it is... the plugin has access to pipeline.attr, so the plugin author can say 'you have to set pipeline.attr to use this plugin. so, it's already like that.

the question here is just: do we need a special category for output file attributes that are assumed relative? or do we need to provide a special class of attributes that are populated from templates by the namespace variables? this is what we currently lack.

stolarczyk commented 4 years ago

I think I like this option the most:

An alternative idea: should we eliminate the implied relative interpretation and simply force you to state the relative folder if you want to use it? they are already available in some namespace variables: looper.output_dir for the output paths, and I guess we could add something like looper.piface_dir or something. The advantage of this is 1) things are explicit instead of implicit; and 2) there's no distinction between "input paths" and "output paths", so we'd be back again to just the paths variable.

It's both least confusing to use and seems easiest to implement; we'd just need to allow the values under path use namespaces, right?

stolarczyk commented 3 years ago

I've just implemented the approach to solve this from the comment above, and now I was able to do:

pipeline_name: PIPELINE1
pipeline_type: sample
paths:
  main: "{looper.piface_dir}/pipelines/pipeline1.py"
  hook1: "{looper.piface_dir}/pipelines/hook1.py"
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"

command_template: >
  {pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome}

nsheff commented 3 years ago

nice! just curious, is it required to have those template values in quotes?

stolarczyk commented 3 years ago

Yes, when the value starts with curly braces. Otherwise it would be parsed as a mapping

nsheff commented 3 years ago

Ok, so this is what we have.

old way:

pipeline_name: refgenie
pipeline_type: sample
input_schema: build_inschema.yaml
compute:
  dynamic_variables_command_template: pipeline_interfaces/build-compute-params.py -a {sample.asset} -g {sample.genome}
command_template: >
  refgenie build { sample.genome }/{ sample.asset }
  {% if sample.genome_description is defined %}--genome-description '{ sample.genome_description }'{% endif %}
  {%- if sample.input_value is defined -%}
    {%- for it in sample.input_type -%}
      {%- if it == 'files' -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.local_input + sample.input_id[loop.index0]}
      {%- else -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.input_value[loop.index0]}
      {%- endif -%}
    {%- endfor -%}
  {%- endif -%}

new way:

pipeline_name: refgenie
pipeline_type: sample
input_schema: build_inschema.yaml
paths:
  refgenie: "refgenie"
  compute_vars: "{looper.piface_dir}/build-compute-params.py"
pre_submit:
  command_template: >
    {pipeline.paths.compute_vars} -a {sample.asset} -g {sample.genome}
command_template: >
  {pipeline.paths.refgenie} build { sample.genome }/{ sample.asset }
  {% if sample.genome_description is defined %}--genome-description '{ sample.genome_description }'{% endif %}
  {%- if sample.input_value is defined -%}
    {%- for it in sample.input_type -%}
      {%- if it == 'files' -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.local_input + sample.input_id[loop.index0]}
      {%- else -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.input_value[loop.index0]}
      {%- endif -%}
    {%- endfor -%}
  {%- endif -%}

nsheff commented 3 years ago

Ok, one issue is that the command_template under pre_submit is a list, while the other one is a singleton... So the above doesn't work because:

Ignoring invalid pipeline interface source: /sfs/qumulo/qproject/shefflab/deploy/plantref/plantref/pipeline_interfaces/build_piface.yaml. Caught exception: '{pipeline.paths.compute_vars} -a {sample.asset} -g {sample.genome}\n' is not of type 'array'

You have to do:

pre_submit:
  command_template: 
    - "{pipeline.paths.compute_vars} -a {sample.asset} -g {sample.genome}"

how can we reconcile the need to accommodate multiple hooks with the need to make the interface to command_template uniform? Make the 2nd one also require a list?

nsheff commented 3 years ago

syntax should become:

pre_submit:
  python_functions:
  command_templates:

functionality doesn't need to change.

nsheff commented 3 years ago

@afrendeiro we've finalized the design and implemented this now. So, this will introduce some breaking changes for some of your pipelines, which will now be required to specify that they need to use that plugin in the new way. Should be as easy as as adding:

pre_submit:
  python_functions:
    - looper.write_sample_yaml
var_templates:
  sample_yaml_path: "{looper.output_dir}/submission/{sample.sample_name}.yaml"

The output file path can be customized using var_templates.sample_yaml_path. If this parameter is not provided, the file will be saved as {looper.output_dir}/submission/{sample.sample_name}_sample.yaml.

pepkit / looper