pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Looper summarizer specification and spin-off #242

Closed nsheff closed 10 months ago

nsheff commented 4 years ago

After completing pepkit/looper#238, the summarizer will be simplified to only the built-in summarizer (no more custom summarizers). This is because we now have runp for project-level pipelines, which replace the need for the custom summarizers, which were basically project-level pipelines.

At this point, we should:

  1. formally define the specification for how to communicate with the summarizer (right now it's an undocumented coupling to pypiper).
  2. spin off the summarizer as a standalone report summary tool.
  3. define types (merge 'stats' and 'objects' into one file with different formal types.

Types and their function could be:

HTML results would fall under 'file' I guess.

nsheff commented 4 years ago

In reviewing old issues, I have another thought... not only the actual reported results, but also the status of the run could fall under this same spec. Or is it a different spec?

The analog is the flag system. Pypiper is setting flags, and looper reads them with 'looper check'.

With a separate formal tool for that, we would outsource the status to that alternative system. it sort of seems to fit since this summary tool is also used to sort of watch the progress (as you summarize and things get reported). We want to know when the job is complete, for example.

nsheff commented 4 years ago

This becomes pipestat. It is a python package with a CLI, which operates like:

pipestat stat_name stat_type value

eg

pipestat Aligned_reads numeric 3000000

Can also be called from python:

import pipestat
psm = pipestat.PipeStatManager(database_connection_or_path)
psm.write("Aligned_reads", "numeric", 3000000)

We document a CLI and python API. Pypiper uses the python API; any shell pipeline could use pipestat in its CLI.

pipestat summarize can create a table summarizing stuff. it's the table function of looper, independent of looper. It just needs a list of the samples. Where does it get that list? Well, it can just take a list of files, or a database connection. Looper can just manage that list of files or database connection.

A few remaining questions:

nsheff commented 4 years ago

A use case of the CLI for pipestat: https://github.com/pepkit/hello_looper/issues/3

nsheff commented 4 years ago

Another related issue: Right now, looper reads flags output from pypiper, but expects these to be in a particular location.

The refgenie build process puts them in a subfolder of the canonical outfolder, to separate the build logs from the pipeline results that go into the archive. because of this, looper can't find the flag, and doesn't know which jobs are complete.

So, there needs to be:

  1. a way for the the pipeline interface to specify how to find the flags
  2. hopefully, a formal specification for what the flags are, so that non-pypiper-pipelines can use this feature.

This issue had been previously raised as pepkit/pipestat#34