pepkit / pipestat

Pipeline results reporting package
https://pep.databio.org/pipestat/
BSD 2-Clause "Simplified" License
4 stars 2 forks source link

What is multi_pipelines and when should it be used? #156

Closed nsheff closed 3 months ago

nsheff commented 5 months ago

what is multi_pipelines? When should it be used? This is not documented well.

The docstring says:

:param bool multi_pipelines: allows for running multiple pipelines for one file backend

There are no other details.

Here I'm attempting to start to document this, for future reference and integration into the docs site.

The recommended way to use pipestat is that each pipeline (which corresponds to a pipestat namespace) has its own output file.

However, pipestat can also work in an environment where multiple pipelines all write to the same output file. This is not recommended, since it increases writing to that one file and can lead to performance issues if there are multiple pipelines and lots of samples, but you can do it.

I think the point of multi_pipelines=True is that you have to pass this if you are doing this latter case: multiple pipelines (namespaces) writtten to the same file. But why? what's exactly the limitation here that requires a separate parameter?

I guess there's a third possibility, too: what if you set up pipestat to have a different output not just for each pipeline, but also for each sample?

Some questions we need to document are:

  1. how does pipestat aggregate results if you write to separate files?
  2. is there a way to configure looper to pass separate results files for each sample? it seems to me right now that you can only configure looper with a single file for the project.
donaldcampbelljr commented 5 months ago

From the commit message, it appears that multi_pipelines was added to ensure pypiper compatibility when changing pypiper's report_result and report_object to use pipestat's report function: 93553eaacf84ae9815d86d1513dd35687a3f7ed6.

This flag overrides a check during file backend loading that will not let the user load a results file from a different pipeline.

For configuring Looper to report separately for each individual record_identifier, (using pipestat) one can add a results file path that includes {record_identifier}:

example from PEPATAC:

name: PEPATAC_tutorial
pep_config: tutorial_refgenie_project_config.yaml

output_dir: "${TUTORIAL}/processed/"
pipeline_interfaces:
  sample: ["${TUTORIAL}/tools/pepatac/sample_pipeline_interface.yaml"]
  project: ["${TUTORIAL}/tools/pepatac/project_pipeline_interface.yaml"]

pipestat:
  results_file_path: "${TUTORIAL}/processed/results_pipeline/{record_identifier}/stats.yaml"

For aggregating result files, there is a function in pipestat that does this aggregate_multi_results but it is only called after a check with check_multi_results which happens if pipestat summarize (looper report), pipestat link (looper link), or pipestat table is used.

donaldcampbelljr commented 3 months ago

Ok, I've added clarification to the pipestat docs regarding this.