pepkit / pipestat

Pipeline results reporting package
https://pep.databio.org/pipestat/
BSD 2-Clause "Simplified" License
4 stars 2 forks source link

Integrate pipestat schema with looper output schema #20

Closed nsheff closed 11 months ago

nsheff commented 1 year ago

Could the pipestat schema be adapted to integrate with the looper output schema? See #16

The looper output schema formally specifies the output produced by this pipeline. It is used by downstream tools to that need to be aware of the products of the pipeline for further visualization or analysis. Like the input schema, it is based on the extended PEP JSON-schema validation framework, but adds looper-specific capabilities.

Currently, the looper schema is used to define the output files of the pipeline. Each of these includes a path and the types are specified as being one of link, image, or file.

But pipelines also produce statistics -- primitive types, like string or int. In fact, that is what the pipestat schema specifies.

It would be more convenient if these two schemas became one. If looper is pipestat-aware, then the looper output schema should just be the pipestat schema. To do this:

Then, the two schemas could be combined; the pipeline interface would point to the pipestat schema as the output schema. Looper would use pipestat to manage outputs.

nsheff commented 1 year ago

@stolarczyk If you have any comment/insight here it would be helpful now since we're going to dot his soon.

nsheff commented 1 year ago

Old example pipestat schema (from docs).

number_of_things:
  type: integer
  description: "Number of things, min 20, multiple of 10"
  multipleOf: 10
  minimum: 20
name_of_something:
  type: string
  description: "Name of something, min len 2 characters"
  minLength: 2
collection_of_things:
  type: array
  items:
    type: string
  description: "This store collection of strings"
output_object:
  type: object
  properties:
    property1:
      array:
        items:
          type: integer
    property2:
      type: boolean
  required:
    - property1
  description: "Object output with required array of integers and optional boolean"

Old example looper output schema (from docs)

description: objects produced by PEPPRO pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        smooth_bw: 
          path: "aligned_{genome}/{sample_name}_smooth.bw"
          type: string
          description: "A smooth bigwig file"
        aligned_bam: 
          path: "aligned_{genome}/{sample_name}_sort.bam"
          type: string
          description: "A sorted, aligned BAM file"
        peaks_bed: 
          path: "peak_calling_{genome}/{sample_name}_peaks.bed"
          type: string
          description: "Peaks in BED format"
  tss_file:
    title: "TSS enrichment file"
    description: "Plots TSS scores for each sample."
    thumbnail_path: "summary/{name}_TSSEnrichment.png"
    path: "summary/{name}_TSSEnrichment.pdf"
    type: image
  counts_table:
    title: "Project peak coverage file"
    description: "Project peak coverages: chr_start_end X sample"
    path: "summary/{name}_peaks_coverage.tsv"
    type: link

Combining these, into the new "pipestat output schema", I could naively get something like this:

description: Pipestat output schema describing outputs of PEPPRO pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        smooth_bw: 
          path: "aligned_{genome}/{sample_name}_smooth.bw"
          type: string
          description: "A smooth bigwig file"
        aligned_bam: 
          path: "aligned_{genome}/{sample_name}_sort.bam"
          type: string
          description: "A sorted, aligned BAM file"
        peaks_bed: 
          path: "peak_calling_{genome}/{sample_name}_peaks.bed"
          type: string
          description: "Peaks in BED format"
  tss_file:
    title: "TSS enrichment file"
    description: "Plots TSS scores for each sample."
    thumbnail_path: "summary/{name}_TSSEnrichment.png"
    path: "summary/{name}_TSSEnrichment.pdf"
    type: image
  counts_table:
    title: "Project peak coverage file"
    description: "Project peak coverages: chr_start_end X sample"
    path: "summary/{name}_peaks_coverage.tsv"
    type: link
  number_of_things:
    type: integer
    description: "Number of things, min 20, multiple of 10"
    multipleOf: 10
    minimum: 20
  name_of_something:
    type: string
    description: "Name of something, min len 2 characters"
    minLength: 2
  collection_of_things:
    type: array
    items:
      type: string
    description: "This store collection of strings"
  output_object:
    type: object
    properties:
      property1:
        array:
          items:
            type: integer
      property2:
        type: boolean
    required:
      - property1
    description: "Object output with required array of integers and optional boolean"
vreuter commented 1 year ago

@nsheff the pipestat schema shown here doesn't seem to work (on master or on dev); did the docs get out of sync with the strictness / stringency of the validation? Here's what I get when trying to use the schema you linked as complex_schema.yaml:

pipestat) vince@vr-think:~/code/pipestat$ pipestat retrieve -i number_of_things -s complex_schema.yaml 
Traceback (most recent call last):
  File "/home/vince/venvs_python/pipestat/bin/pipestat", line 8, in <module>
    sys.exit(main())
  File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/cli.py", line 29, in main
    psm = PipestatManager(
  File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 172, in __init__
    self.validate_schema()
  File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 859, in validate_schema
    schema = _recursively_replace_custom_types(schema)
  File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 833, in _recursively_replace_custom_types
    _recursively_replace_custom_types(s[k][SCHEMA_PROP_KEY])
  File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 826, in _recursively_replace_custom_types
    assert SCHEMA_TYPE_KEY in v, SchemaError(
AssertionError: Result 'property1' is missing 'type' key

FYI line numbers here are on dev

nsheff commented 1 year ago

Hm, I don't have an answer to that. I am not sure why it's not working.

donaldcampbelljr commented 1 year ago

Discussion: -Continue to keep integrated, will discuss splitting them at a later date.

donaldcampbelljr commented 1 year ago

Eventually, looper report will be moved to pipestat and the output schemas will need to be aligned during that time. Moving this to pipestat milestone v0.5.0

donaldcampbelljr commented 1 year ago

Clarification, looper report still exists but will now use pipestat summarizer if Looper is configured to use pipestat.

Currently, if looper is pipestat aware (i.e. we've passed looper a pipestat config file), the pipeline_interface must point to an output schema that is pipestat compatible.

Regarding defined outputs, pipestat supports the following:


CLASSES_BY_TYPE = {
    "object": dict,
    "number": float,
    "integer": int,
    "string": str,
    "path": Path,
    "boolean": bool,
    "file": str,
    "image": str,
    "link": str,
    "array": list_of_dicts,
}
donaldcampbelljr commented 11 months ago

Currently, pipestat requires that each output be placed under 'samples' or 'project':

So this output schema:

pipeline_name: test_pipe
properties:
  samples:
    number_of_things:
      type: integer
      description: "Number of things"
    percentage_of_things:
      type: number
      description: "Percentage of things"
    name_of_something:
      type: string
      description: "Name of something"

testing_project_result:
  type: string
  description: "misc project result"
more_number_of_things:
  type: integer
  description: "Number of things, min 20, multiple of 10"
  multipleOf: 10
  minimum: 20

must be written like so:

pipeline_name: test_pipe
samples:
  number_of_things:
    type: integer
    description: "Number of things"
  percentage_of_things:
    type: number
    description: "Percentage of things"
project:
  testing_project_result:
    type: string
    description: "misc project result"
  more_number_of_things:
    type: integer
    description: "Number of things, min 20, multiple of 10"
    multipleOf: 10
    minimum: 20

Otherwise, Pipestat will throw an exception: pipestat.exceptions.SchemaError: Extra top-level key(s) in given schema data