Closed nsheff closed 3 months ago
From the commit message, it appears that multi_pipelines
was added to ensure pypiper compatibility when changing pypiper's report_result
and report_object
to use pipestat's report function: 93553eaacf84ae9815d86d1513dd35687a3f7ed6.
This flag overrides a check during file backend loading that will not let the user load a results file from a different pipeline.
For configuring Looper to report separately for each individual record_identifier
, (using pipestat) one can add a results file path that includes {record_identifier}
:
example from PEPATAC:
name: PEPATAC_tutorial
pep_config: tutorial_refgenie_project_config.yaml
output_dir: "${TUTORIAL}/processed/"
pipeline_interfaces:
sample: ["${TUTORIAL}/tools/pepatac/sample_pipeline_interface.yaml"]
project: ["${TUTORIAL}/tools/pepatac/project_pipeline_interface.yaml"]
pipestat:
results_file_path: "${TUTORIAL}/processed/results_pipeline/{record_identifier}/stats.yaml"
For aggregating result files, there is a function in pipestat that does this aggregate_multi_results
but it is only called after a check with check_multi_results
which happens if pipestat summarize (looper report), pipestat link (looper link), or pipestat table is used.
Ok, I've added clarification to the pipestat docs regarding this.
what is
multi_pipelines
? When should it be used? This is not documented well.The docstring says:
There are no other details.
Here I'm attempting to start to document this, for future reference and integration into the docs site.
The recommended way to use pipestat is that each pipeline (which corresponds to a pipestat namespace) has its own output file.
However, pipestat can also work in an environment where multiple pipelines all write to the same output file. This is not recommended, since it increases writing to that one file and can lead to performance issues if there are multiple pipelines and lots of samples, but you can do it.
I think the point of
multi_pipelines=True
is that you have to pass this if you are doing this latter case: multiple pipelines (namespaces) writtten to the same file. But why? what's exactly the limitation here that requires a separate parameter?I guess there's a third possibility, too: what if you set up pipestat to have a different output not just for each pipeline, but also for each sample?
Some questions we need to document are: