pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Decouple the looper pipeline settings from the PEP sample metadata #342

Closed nsheff closed 12 months ago

nsheff commented 1 year ago

The question

To run a project, we need 1) a pipeline, and 2) some samples to run the pipeline on. They're independent, since I could run a different pipeline on those samples, or run that pipeline on different samples. So, it seems nice to specify them independently.

But right now, looper sticks its fingers all over into PEP, which specifies the sample metadata. For example:

  1. Looper accepts a PEP, which points to a pipeline interface. So, the PEP (sample table) is directly connected to the pipeline settings.
  2. You specify command-line parameters to the pipelines by using sample_modifiers. So, you're configuring the pipeline run using the PEP.

Because we are modifying the PEP to define and modify the pipeline, this couples the PEP to the particular pipeline. But wait -- a great thing about PEP is that these things are happening inside the yaml file, and not in the sample table. So, that's nice, yes ... That's great -- but it wouldn't it be even better if the entire PEP were portable? Maybe... but on the other hand, in some sense the whole point of the PEP was to move the non-portable stuff into the config file.

Possible solutions

  1. One possibility is to have the looper config specify the PEP and the pipeline settings (interface/parameters), independently. So the looper config then points to two places, instead of one, and the pipeline settings are removed from the PEP.

  2. Alternatively, this could be done using the PEP import project modifier. To make the config file also portable, you could just have two config files, one that imports the other. So, the "outer" config, that you pass to looper, would import the other one. All pipeline/analysis-specific settings would exist in the outer config. Then, the "inner" config (the portable one) would have only information pertaining to the samples.

Advantages and disadvantages

nsheff commented 1 year ago

I'm a bit conflicted here and would like to hear if anyone who has lots of experience with PEP/looper has any ideas... Opinions, @stolarczyk @vreuter @jpsmith5 @afrendeiro ?

jpsmith5 commented 1 year ago

I think I lean toward option 1 actually. I feel like it may be easier conceptually to have a separate config file outside of the PEP file.