Closed donaldcampbelljr closed 3 weeks ago
Here is _get_pipestat_configuration
:
https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L505-L599
Looper creates pipestat managers using: https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L450-L469
This func creates PipestatManagerObjects based on the configuration made within _get_pipestat_configuration
After today's discussion: ~- [ ] Looper should not create a pipestat config for every sample. Looper should use the hook system to create a pipestat configuration file that is to be used for the entire PEP.~ ~- [ ] similarly, looper should not create a pipestat object for each sample, it should instead create a PipestatManager once for the entire pep.~
What about when more than one pipeline interface is specified? Create a pipestat instance for each and thus each would have their own config file ala: pipeline_name_pipestat_config.yaml
?
pipeline_interfaces:
sample:
- ../pipeline/pipeline_interface1_sample.yaml
- ../pipeline/pipeline_interface2_sample.yaml
project:
- ../pipeline/pipeline_interface1_project.yaml
- ../pipeline/pipeline_interface2_project.yaml
That probably makes sense.
the alternative would be, you could maybe just update the pipeline interface of an existing PipestatManager object
But maybe it makes sense to attach the PipestatManager to the pipeline interface object, so there would be one per pipeline interface.
But maybe it makes sense to attach the PipestatManager to the pipeline interface object, so there would be one per pipeline interface.
Yeah, I'm leaning towards this and will investigate this approach.
Another question/issue I've run into this afternoon:
What's the priority for sourcing the pipeline name?
Pipestat looks at the one supplied in the output_schema first. However, the pipeline_interface (either sample or project) used with Looper also has a pipeline_name and points to a pipestat output schema.
These pipeline_names could be different if the user wishes to use one output_schema for two different interfaces.
I just modified the order in which Pipestat can determine the pipeline_name:
self.cfg[PIPELINE_NAME] = (
pipeline_name
or self.cfg[CONFIG_KEY].get(PIPELINE_NAME)
or self.cfg[SCHEMA_KEY].pipeline_name
if self.cfg[SCHEMA_KEY] is not None
else DEFAULT_PIPELINE_NAME
)
So user_supplied -> config_supplied -> schema_supplied else use the default.
Follow up issue:
I believe the project level execution currently only works when using runp
command. It appears as though there was an argument project
that would enable project-level running for all other commands (except run
) but I believe this is now broken. Need to add this back in so that looper only creates and uses project-level PipestatManagers when appropriate. Looper is currently creating psm
s for every pipeline_interface whether it be sample or project. Due to this and the above issue, I'm seeing errors where sample-level and project-level executions are attempting to write to the same results.yaml
file while having different pipeline_names and it is throwing an error (because multi_pipelines
is set to False
by default).
Some notes for next Steps:
The above will be fixes for 1.9.0 that should work for the majority of cases until we do more work to the system for 2.0 release.
For Looper 2.0, we will consolidate sample and project level pipeline interfaces under a single interface. This will break backwards compatibility.
pipeline_name: example_pipestat_pipeline
output_schema: pipestat_output_schema.yaml
sample_interface:
command_template: >
python {looper.piface_dir}/count_lines.py {sample.file} {sample.sample_name} {pipestat.results_file}
project_interface:
command_template: >
python {looper.piface_dir}/count_lines_project.py {sample.file} {pipestat.results_file}
And the output schema pipeline name must match with the pipeline interface pipeline_name. We will enforce this by raising an exception of they do not match.
For sample vs project level commands, see this related issue: https://github.com/pepkit/looper/issues/360
Ok, the refactoring is nearly complete and I've also added the --project
argument back to the CLI. I will still need to do some final clean up/testing before merging.
Currently, Looper checks if Pipestat has been configured for each sample before adding the sample to the submission conductor.
If pipestat can be successfully configured, looper generates a configuration file to be used by pipestat called
looper_pipestat_config.yaml
which looks something like this:Currently, user adds pipestat field to
.looper.yaml
file with relevant info:after setting everything up, looper creates a pipestat config file which can be used by the pipeline author to configure pipestat by passing that along to a pipestat instance within a pipeline:
For example: the pipeline interface author (pipeline author) can pass these values to the pipeline:
pipeline_interface (for a pipeline.py):
pipeline_interface (for a shell pipeline):
How looper checks for pipestat configuration: https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L336-L352
The main functions for this are
_check_if_pipestat_configured
and_get_pipestat_configuration
.Code moves through
_check_if_pipestat_configured
first and will return True or False. If there is any exception raised during the next step for either a single sample or a project, it will return False.https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L471-L503
If this function returns False, looper continues, assuming the user does not wish to use pipestat.
Related Issues: https://github.com/pepkit/looper/issues/411 https://github.com/pepkit/looper/issues/413 https://github.com/pepkit/looper/issues/425 https://github.com/pepkit/looper/issues/459 https://github.com/pepkit/looper/issues/471