pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Project.get_outputs() functionality #165

Closed nsheff closed 4 years ago

nsheff commented 5 years ago

Caravel would like a list of outputs produced by a pipeline.

This information is stored in the pipeline_interface, which looper.Project contains. The pipeline_interface could encode outputs using syntax like this:

pipelines:
  pepatac.py:
    name: PEPATAC
    path: pipelines/pepatac.py
    looper_args: True
    arguments:
      "--sample-name": sample_name
    optional_arguments:
      "--input2": read2
    outputs:
      smooth_bw: "aligned_{sample.genome}/{sample.name}_smooth.bw"
      pre_smooth_bw: "aligned_{project.prealignments}/{sample.name}_smooth.bw"
    compute:
      singularity_image: ${SIMAGES}pepatac
    summarizers:
      - tools/PEPATAC_summarizer.R
    summary_results:
      - alignment_percent_file:
        caption: "Alignment percent file"
        description: "Plots percent of total alignment to all pre-alignments and primary genome."
        thumbnail_path: "summary/{name}_alignmentPercent.png"
        path: "summary/{name}_alignmentPercent.pdf"

the get_outputs function should return a nested Dict:

{
pipeline: {
  output_name: {
    path: output_path,
    samples: [sample_key1, sample_key2, ...]
  }
}

{
PEPATAC: {
  smooth_bw: {
    path: "aligned_{sample.genome}/{sample.name}_smooth.bw",
    samples: [sample_key1, sample_key2, ...]
  }
}

This best preserves the structure of outputs. they need not have unique names across pipelines.

The Project object will need to look at each PipelineInterface it holds, see if it provides any outputs, and then identify any samples that would run that pipeline.

nsheff commented 5 years ago

Related to other issues dealing with the pipeline_interface structure:

61

32

5

nsheff commented 5 years ago

I wrote a function that takes this output and populates the actual paths... maybe this should just belong on the project object as well?

    populated_outputs = {}
    # populate path variables
    for pipeline_name, pipeline_outputs in project_outputs:
        populated_outputs[pipeline_name] = {}
        for output_name, output_info in pipeline_outputs:
            populated_outputs[pipeline_name][output_name] = {}
            for sample in output_info.samples:
                populated_output = "".join("{base_url}data/{project.metadata.results_subdir}/{sample.name}",
                    output_info.path).format(sample=globs.p.get_sample(sample), 
                    base_url=request.url_root,
                    project=globs.p)