pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Refactor Looper's integration of Pipestat #492

Closed donaldcampbelljr closed 3 weeks ago

donaldcampbelljr commented 1 month ago

Currently, Looper checks if Pipestat has been configured for each sample before adding the sample to the submission conductor.

If pipestat can be successfully configured, looper generates a configuration file to be used by pipestat called looper_pipestat_config.yaml which looks something like this:

results_file_path: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./results.yaml
flag_file_dir: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./results/flags
record_identifier: frog_2
output_dir: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./results
schema_path: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./pipeline_pipestat/pipestat_output_schema.yaml
pipeline_name: test_pipe
pipeline_type: sample

Currently, user adds pipestat field to .looper.yaml file with relevant info:

pep_config: ./project/project_config.yaml # pephub registry path or local path  
output_dir: ./results  
pipeline_interfaces:  
  sample:  ./pipeline_pipestat/pipeline_interface.yaml  
  project: ./pipeline_pipestat/pipeline_interface_project.yaml  
pipestat:  
  results_file_path: results.yaml  
  flag_file_dir: results/flags

after setting everything up, looper creates a pipestat config file which can be used by the pipeline author to configure pipestat by passing that along to a pipestat instance within a pipeline:

results_file_path: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./results.yaml  
flag_file_dir: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./results/flags  
output_dir: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./results  
schema_path: /home/drc/GITHUB/hello_looper/hello_looper/pipestat/./pipeline_pipestat/pipestat_output_schema.yaml  
pipeline_name: example_pipestat_pipeline  
pipeline_type: sample  
record_identifier: frog_2

For example: the pipeline interface author (pipeline author) can pass these values to the pipeline:

pipeline_interface (for a pipeline.py):

pipeline_name: example_pipestat_pipeline  
pipeline_type: sample  
output_schema: pipestat_output_schema.yaml  
command_template: >  
  python {looper.piface_dir}/count_lines.py {sample.file} {sample.sample_name} {pipestat.results_file}

pipeline_interface (for a shell pipeline):

pipeline_name: example_pipestat_pipeline  
pipeline_type: sample  
output_schema: pipestat_output_schema.yaml  
command_template: >  
  {looper.piface_dir}/count_lines_pipestat.sh {sample.file} {sample.sample_name} {pipestat.config_file}

How looper checks for pipestat configuration: https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L336-L352

The main functions for this are _check_if_pipestat_configured and _get_pipestat_configuration.

Code moves through _check_if_pipestat_configured first and will return True or False. If there is any exception raised during the next step for either a single sample or a project, it will return False.

https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L471-L503

If this function returns False, looper continues, assuming the user does not wish to use pipestat.

Related Issues: https://github.com/pepkit/looper/issues/411 https://github.com/pepkit/looper/issues/413 https://github.com/pepkit/looper/issues/425 https://github.com/pepkit/looper/issues/459 https://github.com/pepkit/looper/issues/471

donaldcampbelljr commented 1 month ago

Here is _get_pipestat_configuration: https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L505-L599

donaldcampbelljr commented 1 month ago

Looper creates pipestat managers using: https://github.com/pepkit/looper/blob/389967231963ee00020baf93b5cc66288fc32745/looper/project.py#L450-L469

This func creates PipestatManagerObjects based on the configuration made within _get_pipestat_configuration

donaldcampbelljr commented 1 month ago

After today's discussion: ~- [ ] Looper should not create a pipestat config for every sample. Looper should use the hook system to create a pipestat configuration file that is to be used for the entire PEP.~ ~- [ ] similarly, looper should not create a pipestat object for each sample, it should instead create a PipestatManager once for the entire pep.~

donaldcampbelljr commented 1 month ago

Question:

What about when more than one pipeline interface is specified? Create a pipestat instance for each and thus each would have their own config file ala: pipeline_name_pipestat_config.yaml ?

pipeline_interfaces:
  sample:
    - ../pipeline/pipeline_interface1_sample.yaml
    - ../pipeline/pipeline_interface2_sample.yaml
  project:
    - ../pipeline/pipeline_interface1_project.yaml
    - ../pipeline/pipeline_interface2_project.yaml
nsheff commented 1 month ago

That probably makes sense.

the alternative would be, you could maybe just update the pipeline interface of an existing PipestatManager object

But maybe it makes sense to attach the PipestatManager to the pipeline interface object, so there would be one per pipeline interface.

donaldcampbelljr commented 1 month ago

But maybe it makes sense to attach the PipestatManager to the pipeline interface object, so there would be one per pipeline interface.

Yeah, I'm leaning towards this and will investigate this approach.

donaldcampbelljr commented 1 month ago

Another question/issue I've run into this afternoon:

What's the priority for sourcing the pipeline name?

Pipestat looks at the one supplied in the output_schema first. However, the pipeline_interface (either sample or project) used with Looper also has a pipeline_name and points to a pipestat output schema.

These pipeline_names could be different if the user wishes to use one output_schema for two different interfaces.

I just modified the order in which Pipestat can determine the pipeline_name:

        self.cfg[PIPELINE_NAME] = (
            pipeline_name
            or self.cfg[CONFIG_KEY].get(PIPELINE_NAME)
            or self.cfg[SCHEMA_KEY].pipeline_name
            if self.cfg[SCHEMA_KEY] is not None
            else DEFAULT_PIPELINE_NAME
        )

So user_supplied -> config_supplied -> schema_supplied else use the default.

donaldcampbelljr commented 1 month ago

Follow up issue: I believe the project level execution currently only works when using runp command. It appears as though there was an argument project that would enable project-level running for all other commands (except run) but I believe this is now broken. Need to add this back in so that looper only creates and uses project-level PipestatManagers when appropriate. Looper is currently creating psms for every pipeline_interface whether it be sample or project. Due to this and the above issue, I'm seeing errors where sample-level and project-level executions are attempting to write to the same results.yaml file while having different pipeline_names and it is throwing an error (because multi_pipelines is set to False by default).

donaldcampbelljr commented 1 month ago

Some notes for next Steps:

The above will be fixes for 1.9.0 that should work for the majority of cases until we do more work to the system for 2.0 release.

donaldcampbelljr commented 1 month ago

For Looper 2.0, we will consolidate sample and project level pipeline interfaces under a single interface. This will break backwards compatibility.

pipeline_name: example_pipestat_pipeline
output_schema: pipestat_output_schema.yaml
sample_interface:
  command_template: >
   python {looper.piface_dir}/count_lines.py {sample.file} {sample.sample_name} {pipestat.results_file}
project_interface:
  command_template: >
   python {looper.piface_dir}/count_lines_project.py {sample.file} {pipestat.results_file}

And the output schema pipeline name must match with the pipeline interface pipeline_name. We will enforce this by raising an exception of they do not match.

donaldcampbelljr commented 1 month ago

For sample vs project level commands, see this related issue: https://github.com/pepkit/looper/issues/360

donaldcampbelljr commented 1 month ago

Ok, the refactoring is nearly complete and I've also added the --project argument back to the CLI. I will still need to do some final clean up/testing before merging.