Looper summarizer specification and spin-off

nsheff commented 4 years ago

After completing pepkit/looper#238, the summarizer will be simplified to only the built-in summarizer (no more custom summarizers). This is because we now have runp for project-level pipelines, which replace the need for the custom summarizers, which were basically project-level pipelines.

At this point, we should:

formally define the specification for how to communicate with the summarizer (right now it's an undocumented coupling to pypiper).
spin off the summarizer as a standalone report summary tool.
define types (merge 'stats' and 'objects' into one file with different formal types.

Types and their function could be:

string (shows up in metadata table)
numeric (shows up in metadata table, autoplotter)
image (has thumbnail attribute, download link)
file (has download link)

HTML results would fall under 'file' I guess.

nsheff commented 4 years ago

In reviewing old issues, I have another thought... not only the actual reported results, but also the status of the run could fall under this same spec. Or is it a different spec?

The analog is the flag system. Pypiper is setting flags, and looper reads them with 'looper check'.

With a separate formal tool for that, we would outsource the status to that alternative system. it sort of seems to fit since this summary tool is also used to sort of watch the progress (as you summarize and things get reported). We want to know when the job is complete, for example.

nsheff commented 4 years ago

This becomes pipestat. It is a python package with a CLI, which operates like:

pipestat stat_name stat_type value

eg

pipestat Aligned_reads numeric 3000000

Can also be called from python:

import pipestat
psm = pipestat.PipeStatManager(database_connection_or_path)
psm.write("Aligned_reads", "numeric", 3000000)

We document a CLI and python API. Pypiper uses the python API; any shell pipeline could use pipestat in its CLI.

pipestat summarize can create a table summarizing stuff. it's the table function of looper, independent of looper. It just needs a list of the samples. Where does it get that list? Well, it can just take a list of files, or a database connection. Looper can just manage that list of files or database connection.

A few remaining questions:

For the CLI, how do we keep track of the sample we're writing to? config file -c pipestat_sample.yaml? env var $PIPESTAT? the python can handle this with the persistent object.
is it worth connecting to a database as an option? for caravel?
the current pypiper stats have a pipeline identifier column, so multiple pipelines can write to one stats file. how will we handle that?
should we let keys be dicts so they can be organized hierarchically? like, qc.aligned_reads, qc.total_reads

nsheff commented 4 years ago

A use case of the CLI for pipestat: https://github.com/pepkit/hello_looper/issues/3

nsheff commented 4 years ago

Another related issue: Right now, looper reads flags output from pypiper, but expects these to be in a particular location.

The refgenie build process puts them in a subfolder of the canonical outfolder, to separate the build logs from the pipeline results that go into the archive. because of this, looper can't find the flag, and doesn't know which jobs are complete.

So, there needs to be:

a way for the the pipeline interface to specify how to find the flags
hopefully, a formal specification for what the flags are, so that non-pypiper-pipelines can use this feature.

This issue had been previously raised as pepkit/pipestat#34

pepkit / looper

Looper summarizer specification and spin-off #242