Closed nsheff closed 11 months ago
@stolarczyk If you have any comment/insight here it would be helpful now since we're going to dot his soon.
Old example pipestat schema (from docs).
number_of_things:
type: integer
description: "Number of things, min 20, multiple of 10"
multipleOf: 10
minimum: 20
name_of_something:
type: string
description: "Name of something, min len 2 characters"
minLength: 2
collection_of_things:
type: array
items:
type: string
description: "This store collection of strings"
output_object:
type: object
properties:
property1:
array:
items:
type: integer
property2:
type: boolean
required:
- property1
description: "Object output with required array of integers and optional boolean"
Old example looper output schema (from docs)
description: objects produced by PEPPRO pipeline.
properties:
samples:
type: array
items:
type: object
properties:
smooth_bw:
path: "aligned_{genome}/{sample_name}_smooth.bw"
type: string
description: "A smooth bigwig file"
aligned_bam:
path: "aligned_{genome}/{sample_name}_sort.bam"
type: string
description: "A sorted, aligned BAM file"
peaks_bed:
path: "peak_calling_{genome}/{sample_name}_peaks.bed"
type: string
description: "Peaks in BED format"
tss_file:
title: "TSS enrichment file"
description: "Plots TSS scores for each sample."
thumbnail_path: "summary/{name}_TSSEnrichment.png"
path: "summary/{name}_TSSEnrichment.pdf"
type: image
counts_table:
title: "Project peak coverage file"
description: "Project peak coverages: chr_start_end X sample"
path: "summary/{name}_peaks_coverage.tsv"
type: link
Combining these, into the new "pipestat output schema", I could naively get something like this:
description: Pipestat output schema describing outputs of PEPPRO pipeline.
properties:
samples:
type: array
items:
type: object
properties:
smooth_bw:
path: "aligned_{genome}/{sample_name}_smooth.bw"
type: string
description: "A smooth bigwig file"
aligned_bam:
path: "aligned_{genome}/{sample_name}_sort.bam"
type: string
description: "A sorted, aligned BAM file"
peaks_bed:
path: "peak_calling_{genome}/{sample_name}_peaks.bed"
type: string
description: "Peaks in BED format"
tss_file:
title: "TSS enrichment file"
description: "Plots TSS scores for each sample."
thumbnail_path: "summary/{name}_TSSEnrichment.png"
path: "summary/{name}_TSSEnrichment.pdf"
type: image
counts_table:
title: "Project peak coverage file"
description: "Project peak coverages: chr_start_end X sample"
path: "summary/{name}_peaks_coverage.tsv"
type: link
number_of_things:
type: integer
description: "Number of things, min 20, multiple of 10"
multipleOf: 10
minimum: 20
name_of_something:
type: string
description: "Name of something, min len 2 characters"
minLength: 2
collection_of_things:
type: array
items:
type: string
description: "This store collection of strings"
output_object:
type: object
properties:
property1:
array:
items:
type: integer
property2:
type: boolean
required:
- property1
description: "Object output with required array of integers and optional boolean"
@nsheff the pipestat schema shown here doesn't seem to work (on master
or on dev
); did the docs get out of sync with the strictness / stringency of the validation? Here's what I get when trying to use the schema you linked as complex_schema.yaml
:
pipestat) vince@vr-think:~/code/pipestat$ pipestat retrieve -i number_of_things -s complex_schema.yaml
Traceback (most recent call last):
File "/home/vince/venvs_python/pipestat/bin/pipestat", line 8, in <module>
sys.exit(main())
File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/cli.py", line 29, in main
psm = PipestatManager(
File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 172, in __init__
self.validate_schema()
File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 859, in validate_schema
schema = _recursively_replace_custom_types(schema)
File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 833, in _recursively_replace_custom_types
_recursively_replace_custom_types(s[k][SCHEMA_PROP_KEY])
File "/home/vince/venvs_python/pipestat/lib/python3.8/site-packages/pipestat/pipestat.py", line 826, in _recursively_replace_custom_types
assert SCHEMA_TYPE_KEY in v, SchemaError(
AssertionError: Result 'property1' is missing 'type' key
FYI line numbers here are on dev
Hm, I don't have an answer to that. I am not sure why it's not working.
Discussion: -Continue to keep integrated, will discuss splitting them at a later date.
Eventually, looper report
will be moved to pipestat and the output schemas will need to be aligned during that time. Moving this to pipestat milestone v0.5.0
Clarification, looper report
still exists but will now use pipestat summarizer if Looper is configured to use pipestat.
Currently, if looper is pipestat aware (i.e. we've passed looper a pipestat config file), the pipeline_interface must point to an output schema that is pipestat compatible.
Regarding defined outputs, pipestat supports the following:
CLASSES_BY_TYPE = {
"object": dict,
"number": float,
"integer": int,
"string": str,
"path": Path,
"boolean": bool,
"file": str,
"image": str,
"link": str,
"array": list_of_dicts,
}
Currently, pipestat requires that each output be placed under 'samples' or 'project':
So this output schema:
pipeline_name: test_pipe
properties:
samples:
number_of_things:
type: integer
description: "Number of things"
percentage_of_things:
type: number
description: "Percentage of things"
name_of_something:
type: string
description: "Name of something"
testing_project_result:
type: string
description: "misc project result"
more_number_of_things:
type: integer
description: "Number of things, min 20, multiple of 10"
multipleOf: 10
minimum: 20
must be written like so:
pipeline_name: test_pipe
samples:
number_of_things:
type: integer
description: "Number of things"
percentage_of_things:
type: number
description: "Percentage of things"
project:
testing_project_result:
type: string
description: "misc project result"
more_number_of_things:
type: integer
description: "Number of things, min 20, multiple of 10"
multipleOf: 10
minimum: 20
Otherwise, Pipestat will throw an exception:
pipestat.exceptions.SchemaError: Extra top-level key(s) in given schema data
Could the pipestat schema be adapted to integrate with the looper output schema? See #16
Currently, the looper schema is used to define the output files of the pipeline. Each of these includes a
path
and the types are specified as being one oflink
,image
, orfile
.But pipelines also produce statistics -- primitive types, like
string
orint
. In fact, that is what the pipestat schema specifies.It would be more convenient if these two schemas became one. If looper is pipestat-aware, then the looper output schema should just be the pipestat schema. To do this:
Then, the two schemas could be combined; the pipeline interface would point to the pipestat schema as the output schema. Looper would use pipestat to manage outputs.