Closed bentsherman closed 11 months ago
I had a quick test last night but the pipeline failed. I didn't see any outputs. I'll try again later - but any chance that we could still get it to generate a report even if the run fails? 🤔
At first with your instructions I got this:
$ nextflow run nf-core/rnaseq -c nextflow.config -profile docker,test --outdir results
N E X T F L O W ~ version 23.04.3
Launching `https://github.com/nf-core/rnaseq` [deadly_cori] DSL2 - revision: 3bec2331ca [master]
Unable to overwrite existing file manifest: /workspace/nf-prov/results
But if I changed the config line prov.file = "${params.outdir}/prov"
and I got a file called results/prov
: prov.json.zip
At first with your instructions I got this:
$ nextflow run nf-core/rnaseq -c nextflow.config -profile docker,test --outdir results N E X T F L O W ~ version 23.04.3 Launching `https://github.com/nf-core/rnaseq` [deadly_cori] DSL2 - revision: 3bec2331ca [master] Unable to overwrite existing file manifest: /workspace/nf-prov/results
But if I changed the config line
prov.file = "${params.outdir}/prov"
and I got a file calledresults/prov
: prov.json.zip
Yeah, testing now and got the same. Also got it if running with {params.outdir}/prov
, if run twice, so apparently it always fails if the folder already exists. (Removing prov
after the second run, and it works).
(Checking the outputs now!)
I found running validation against the JSON Schema for BioCompute Objects complains about some missing sections and fields.
I guess some of these fields and sections might not be so important, but I guess adding stubs so that the validation runs through can help to see there aren't any more subtle differences in any of the actual data.
What I tried:
git clone http://opensource.ieee.org/2791-object/ieee-2791-schema.git
2a. Run validation with kwalify:
sudo apt install kwalify
kwalify -f ieee-2791-schema/2791object.json bco.json
2b. Run validation with jsonschema
from PyPI/conda:
conda install jsonschema
Create a file validate.py
:
import jsonschema
import json
from jsonschema import validate
# Load Biocompute Object JSON and schema
with open('ieee-2791-schema/2791object.json') as schema_file:
schema = json.load(schema_file)
with open('bco.json') as bco_file:
bco = json.load(bco_file)
# Validate
try:
validate(instance=bco, schema=schema)
print("Validation successful.")
except Exception as e:
print("Validation failed:", e)
Run it:
python validate.py |& tee validation.out | head
Validation for RO-Crate seems a bit thinner on the tooling side. But I tried this:
pip install requests pytest rocrate
pip install rocrateValidator
validate-rocrate.py
:from rocrateValidator import validate as validate
v = validate.validate("ro-crate-metadata.json")
v.validator()
Run it:
python validate-rocrate.py |& tee validate-rocrate.out | head
Get some output:
$ python validate-rocrate.py |& tee validate-rocrate.out | head
This is an INVALID RO-Crate
{
...snip...
Otherwise, slightly off-topic here, but as a way to verify the filepaths, I managed to parse the filepaths and steps of bco.json
into a DAG (code here), so this seems to work great!
The thing I noticed, in relation to is that not much info about the steps themselves are included, such as the commands executed.
I see the execution_domain
lists the main Nextflow script, so indeed, all this info will be referenced from there of course, but not included in the report.
I see in the BCO docs though that there's isn't perhaps a great way to include that per step, so I gather it the schema that is at fault here.
Of course possible to export this info in a separate custom format based on the .nextflow
cache, as discussed earlier, or just parse the .command.sh
files in the work folders, before they are cleaned.
So I guess it is outside the scope of the BioCompute standard. Still feels a bit weird not to include such crucial info in a declarative provenance report, so I guess one would have to package in some other artifacts too, to have a fully reproducible research object.
Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient.
I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard.
Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient.
I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard.
Yea, I also felt it is not exactly clear where to draw the line about meeting the standard.
It's a bummer that the validation tools do a full stop on the first "error". Had been useful to use them to spot any divergences in the actual output...
Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).
Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).
Very good point! The ones I have used were:
Lazy validation that can iteratively report all validation errors. [link]
I made a few fixes, so now the config file should work as is.
For now I have a catch-all config setting prov.metadata
which we can use to insert any metadata not already covered by the manifest
scope. We can build out this scope with whatever extra settings we think are important, and ideally even incorporate it into the manifest
scope.
Glad to see this both for BCO and RO-Crate side!
The ro-crate-validator may be a bit too opinionated for particular use cases, so don't assume the RO-Crate is invalid even if it says so. Work on a more modular validator is been planned but only tested for workflow run crate profile.
For reference:
Hi @stain , thanks for the feedback. I think Phil is going to meet with you and some others tomorrow. We would love to get more feedback from people who are more familiar with these standards, see if there is anything we can improve. My main concerns are:
Listing the tasks (steps) and input/output files. Nextflow tasks produce these files in a work directory during execution, then "publishes" the outputs to their final location at the end. So should the provenance report only use the publish paths? The work directory paths are temporary, but on the other hand, they also define the links between tasks.
There seems to be a bunch of optional metadata fields about the pipeline, contributors, etc, much of which is not know to Nextflow. So I'm wondering how far we should go to provide this extra metadata, which parts are more important or more commonly used than others, etc
Copying in some notes from the recent WorkflowHub meeting. Full notes are here.
- Should we put intermediate files into RO-crate which don’t exist, or if it should only be the published files
- Could put intermediates
- Semantic details are important: for example, imagine a workflow which consumes URLs (remote resource) - URL here is key
- Depends on workflow system how much filename matters
- Specific use case for nextflow?
- There has to be evidence of an intermediate file, but not necessarily the file itself - if there is a guarantee that a file can be accessed, then a different approach is needed (e.g. health)
- https://www.researchobject.org/workflow-run-crate/
- https://www.researchobject.org/ro-crate/
So, I think I read the answer to this:
So should the provenance report only use the publish paths?
As "no, report on workdir paths instead, even if they're temporary. Published paths would be a bonus."
Maybe @stain can correct me on this if I misinterpreted. I didn't find it super clear.
It would be great if the RO-Crate generated by the plugin conformed to a Workflow Run RO-Crate profile (https://www.researchobject.org/workflow-run-crate/profiles/, as linked by @stain above). My understanding is that the plugin has access to individual step executions, so the crate could be made to conform to the Provenance Run Crate profile, which is the most detailed.
A while ago I manually generated a Provenance Run Crate for an execution of the test.nf
workflow:
It was based on the manifest.json
generated by the plugin, which is also included in the crate. Some things are a bit forced, e.g. the FormalParameter
@id
s (I used a pattern based on CWLProv), but hopefully it can serve as an example.
Note that RO-Crate stores the data together with the metadata and thus uses paths relative to the crate root directory, i.e., the directory that hosts ro-crate-metadata.json
. Intermediate files, when present, should also be included in the crate, but what matters from the RO-Crate metadata perspective is the path relative to the RO-Crate root (the plugin could copy intermediate files to the RO-Crate directory).
We have a working group for Workflow Run RO-Crate that meets every two weeks. I think it would be great if you guys joined the group, instructions are here:
https://github.com/ResearchObject/workflow-run-crate/issues/1
Thanks everyone for the feedback. I see that the tutorial I originally used as a reference was creating some generic RO crate, but now there is the Workflow Run Crate standard, which looks similar to BCO in its substance, but perhaps more extensible because it is an RO crate.
I decided to remove the minimal RO crate from this PR and just render the BCO manifest. We can add the WRROC as a separate format in a separate PR, and also add the ability to specify render multiple formats for a single run so that they can be composed as needed.
As for the BCO format, the main thing left to do for this PR is to make the workflow inputs point to a URL instead of a local path (e.g. ${NXF_HOME}/.assets/nextflow-io/rnaseq-nf/multiqc
-> https://github.com/nextflow-io/rnaseq-nf/tree/master/multiqc
). There are several improvements that can still be made, but I will create separate issues for them instead.
Our primary interest is in the BCO format because (as I understand it) the FDA recently adopted it as the standard for research artifacts. Does anyone know if the RO crate standard is a part of this in any way? That will help us prioritize our efforts.
@samuell Thanks for your suggestions and all the testing you did! I will try to incorporate your scripts into this project later on.
Updated BCO example with validation errors fixed:
{
"object_id": "urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858",
"spec_version": "https://w3id.org/ieee/ieee-2791-schema/2791object.json",
"etag": "364e510a9602ae31fc0ed6feba5ddd01",
"provenance_domain": {
"name": "",
"version": "",
"created": "2023-09-27T21:28:13.821355019-05:00",
"modified": "2023-09-27T21:28:13.821355019-05:00",
"contributors": [
{
"contribution": [
"authoredBy"
],
"name": "Paolo Di Tommaso"
}
],
"license": ""
},
"usability_domain": [
],
"extension_domain": [
{
"extension_schema": "https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json",
"scm_extension": {
"scm_repository": "https://github.com/nextflow-io/rnaseq-nf",
"scm_type": "git",
"scm_commit": "d910312506c6539365ed70aacda5068dea9152dd",
"scm_path": "main.nf",
"scm_preview": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
}
}
],
"description_domain": {
"keywords": [
],
"platform": [
"Nextflow"
],
"pipeline_steps": [
{
"step_number": 1,
"name": "641b807d0f3fdb87ca247e807f6e013e",
"description": "RNASEQ:INDEX (ggal_1_48850000_49020000)",
"input_list": [
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
}
],
"output_list": [
{
"uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
}
]
},
{
"step_number": 2,
"name": "b0fde0a381b3abf254cba203158d78a5",
"description": "RNASEQ:FASTQC (FASTQC on ggal_gut)",
"input_list": [
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
}
],
"output_list": [
{
"uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
}
]
},
{
"step_number": 3,
"name": "7a7e087d9ec32fc6b104c072ef42ee14",
"description": "RNASEQ:QUANT (ggal_gut)",
"input_list": [
{
"uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
}
],
"output_list": [
{
"uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
}
]
},
{
"step_number": 4,
"name": "8ec9b607fc6e5620c5437845fcf92fe2",
"description": "MULTIQC",
"input_list": [
{
"uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
},
{
"uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
}
],
"output_list": [
{
"uri": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html"
}
]
}
]
},
"execution_domain": {
"script": [
"https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
],
"script_driver": "nextflow",
"software_prerequisites": [
{
"name": "Nextflow",
"version": "23.09.2-edge",
"uri": {
"uri": "https://github.com/nextflow-io/nextflow/releases/tag/v23.09.2-edge"
}
}
],
"external_data_endpoints": [
],
"environment_variables": {
}
},
"parametric_domain": [
{
"param": "outdir",
"value": "results",
"step": "0"
},
{
"param": "reads",
"value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_{1,2}.fq",
"step": "0"
},
{
"param": "transcriptome",
"value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa",
"step": "0"
},
{
"param": "multiqc",
"value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc",
"step": "0"
}
],
"io_domain": {
"input_subdomain": [
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
}
},
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
}
},
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
}
},
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
}
}
],
"output_subdomain": [
{
"mediatype": "",
"uri": {
"filename": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs",
"uri": "results/fastqc_ggal_gut_logs"
}
},
{
"mediatype": "text/html",
"uri": {
"filename": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html",
"uri": "results/multiqc_report.html"
}
}
]
},
"error_domain": {
"empirical_error": {
},
"algorithmic_error": {
}
}
}
Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).
Very good point! The ones I have used were:
For BCO: JSONSchema
- But now as I checked the README, it seems they actually have an option for checking all errors 😄 :
Lazy validation that can iteratively report all validation errors. [link]
- For RO-Crate, I've used ro-crate-validator-py
We also have this API endpoint for BCO validation: https://biocomputeobject.org/api/docs/#/BCO%20Management/api_objects_validate_create
ok nice, thanks @HadleyKing!
Doing a quick and dirty test by copying the above example into the Swagger interface I get the following response:
{
"urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858": {
"number_of_errors": 0,
"error_detail": [
"BCO Valid"
],
"https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json": {
"number_of_errors": 0,
"error_detail": [
"Extension Valid"
]
}
}
}
So - I think that means that we're looking good..!
Close #2
To test the plugin:
The plugin will generate a
bco.json
andro-crate-metadata.json
in theresults
directory. Check outBcoRenderer.groovy
to see how these files are generated.