Add BCO format - Githubissues

bentsherman commented 12 months ago

Close #2

To test the plugin:

git clone git@github.com:nextflow-io/nf-prov.git -b bco
cd nf-prov
make install

# use nextflow.config in the nf-prov directory
nextflow run [...]

The plugin will generate a bco.json and ro-crate-metadata.json in the results directory. Check out BcoRenderer.groovy to see how these files are generated.

ewels commented 12 months ago

I had a quick test last night but the pipeline failed. I didn't see any outputs. I'll try again later - but any chance that we could still get it to generate a report even if the run fails? 🤔

ewels commented 12 months ago

At first with your instructions I got this:

$ nextflow run nf-core/rnaseq -c nextflow.config -profile docker,test --outdir results
N E X T F L O W  ~  version 23.04.3
Launching `https://github.com/nf-core/rnaseq` [deadly_cori] DSL2 - revision: 3bec2331ca [master]
Unable to overwrite existing file manifest: /workspace/nf-prov/results

But if I changed the config line prov.file = "${params.outdir}/prov" and I got a file called results/prov: prov.json.zip

samuell commented 12 months ago

At first with your instructions I got this:
$ nextflow run nf-core/rnaseq -c nextflow.config -profile docker,test --outdir results
N E X T F L O W  ~  version 23.04.3
Launching `https://github.com/nf-core/rnaseq` [deadly_cori] DSL2 - revision: 3bec2331ca [master]
Unable to overwrite existing file manifest: /workspace/nf-prov/results
But if I changed the config line prov.file = "${params.outdir}/prov" and I got a file called results/prov: prov.json.zip

Yeah, testing now and got the same. Also got it if running with {params.outdir}/prov, if run twice, so apparently it always fails if the folder already exists. (Removing prov after the second run, and it works).

(Checking the outputs now!)

samuell commented 12 months ago

I found running validation against the JSON Schema for BioCompute Objects complains about some missing sections and fields.

I guess some of these fields and sections might not be so important, but I guess adding stubs so that the validation runs through can help to see there aren't any more subtle differences in any of the actual data.

What I tried:

Get the schema definition:

git clone http://opensource.ieee.org/2791-object/ieee-2791-schema.git

2a. Run validation with kwalify:

sudo apt install kwalify
kwalify -f ieee-2791-schema/2791object.json bco.json

2b. Run validation with jsonschema from PyPI/conda:

conda install jsonschema

Create a file validate.py:

import jsonschema

import json
from jsonschema import validate

# Load Biocompute Object JSON and schema
with open('ieee-2791-schema/2791object.json') as schema_file:
    schema = json.load(schema_file)

with open('bco.json') as bco_file:
    bco = json.load(bco_file)

# Validate
try:
    validate(instance=bco, schema=schema)
    print("Validation successful.")
except Exception as e:
    print("Validation failed:", e)

Run it:

python validate.py |& tee validation.out | head

samuell commented 12 months ago

Validation for RO-Crate seems a bit thinner on the tooling side. But I tried this:

Install some prerequisites not automatically installed with rocrateValidator:

pip install requests pytest rocrate

Install the validator

pip install rocrateValidator

Create a python file validate-rocrate.py:

from rocrateValidator import validate as validate

v = validate.validate("ro-crate-metadata.json")
v.validator()

Run it:

python validate-rocrate.py |& tee validate-rocrate.out | head

Get some output:

$ python validate-rocrate.py |& tee validate-rocrate.out | head
This is an INVALID RO-Crate
{
...snip...

samuell commented 12 months ago

Otherwise, slightly off-topic here, but as a way to verify the filepaths, I managed to parse the filepaths and steps of bco.json into a DAG (code here), so this seems to work great!

The thing I noticed, in relation to is that not much info about the steps themselves are included, such as the commands executed.

I see the execution_domain lists the main Nextflow script, so indeed, all this info will be referenced from there of course, but not included in the report.

I see in the BCO docs though that there's isn't perhaps a great way to include that per step, so I gather it the schema that is at fault here.

Of course possible to export this info in a separate custom format based on the .nextflow cache, as discussed earlier, or just parse the .command.sh files in the work folders, before they are cleaned.

So I guess it is outside the scope of the BioCompute standard. Still feels a bit weird not to include such crucial info in a declarative provenance report, so I guess one would have to package in some other artifacts too, to have a fully reproducible research object.

bentsherman commented 12 months ago

Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient.

I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard.

samuell commented 12 months ago

Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient.

I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard.

Yea, I also felt it is not exactly clear where to draw the line about meeting the standard.

It's a bummer that the validation tools do a full stop on the first "error". Had been useful to use them to spot any divergences in the actual output...

ewels commented 12 months ago

Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).

samuell commented 12 months ago

Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).

Very good point! The ones I have used were:

For BCO: JSONSchema
- But now as I checked the README, it seems they actually have an option for checking all errors :smile: :
  
  Lazy validation that can iteratively report all validation errors. [link]
For RO-Crate, I've used ro-crate-validator-py

bentsherman commented 12 months ago

I made a few fixes, so now the config file should work as is.

For now I have a catch-all config setting prov.metadata which we can use to insert any metadata not already covered by the manifest scope. We can build out this scope with whatever extra settings we think are important, and ideally even incorporate it into the manifest scope.

ewels commented 12 months ago

stain commented 11 months ago

Glad to see this both for BCO and RO-Crate side!

The ro-crate-validator may be a bit too opinionated for particular use cases, so don't assume the RO-Crate is invalid even if it says so. Work on a more modular validator is been planned but only tested for workflow run crate profile.

For reference:

https://biocompute-objects.github.io/bco-ro-crate/ (predates Workflow Run Crate profiles)
https://www.researchobject.org/workflow-run-crate/profiles/

bentsherman commented 11 months ago

Hi @stain , thanks for the feedback. I think Phil is going to meet with you and some others tomorrow. We would love to get more feedback from people who are more familiar with these standards, see if there is anything we can improve. My main concerns are:

Listing the tasks (steps) and input/output files. Nextflow tasks produce these files in a work directory during execution, then "publishes" the outputs to their final location at the end. So should the provenance report only use the publish paths? The work directory paths are temporary, but on the other hand, they also define the links between tasks.
There seems to be a bunch of optional metadata fields about the pipeline, contributors, etc, much of which is not know to Nextflow. So I'm wondering how far we should go to provide this extra metadata, which parts are more important or more commonly used than others, etc

ewels commented 11 months ago

Copying in some notes from the recent WorkflowHub meeting. Full notes are here.

Should we put intermediate files into RO-crate which don’t exist, or if it should only be the published files

Could put intermediates

Semantic details are important: for example, imagine a workflow which consumes URLs (remote resource) - URL here is key

Depends on workflow system how much filename matters

Specific use case for nextflow?

There has to be evidence of an intermediate file, but not necessarily the file itself - if there is a guarantee that a file can be accessed, then a different approach is needed (e.g. health)

https://www.researchobject.org/workflow-run-crate/

https://www.researchobject.org/ro-crate/

So, I think I read the answer to this:

So should the provenance report only use the publish paths?

As "no, report on workdir paths instead, even if they're temporary. Published paths would be a bonus."

Maybe @stain can correct me on this if I misinterpreted. I didn't find it super clear.

simleo commented 11 months ago

It would be great if the RO-Crate generated by the plugin conformed to a Workflow Run RO-Crate profile (https://www.researchobject.org/workflow-run-crate/profiles/, as linked by @stain above). My understanding is that the plugin has access to individual step executions, so the crate could be made to conform to the Provenance Run Crate profile, which is the most detailed.

A while ago I manually generated a Provenance Run Crate for an execution of the test.nf workflow:

https://github.com/ResearchObject/workflow-run-crate/tree/86e5d481a4857b997b7b019b92e354c99c957135/docs/examples/draft/nf-prov-test-run-1

It was based on the manifest.json generated by the plugin, which is also included in the crate. Some things are a bit forced, e.g. the FormalParameter @ids (I used a pattern based on CWLProv), but hopefully it can serve as an example.

Note that RO-Crate stores the data together with the metadata and thus uses paths relative to the crate root directory, i.e., the directory that hosts ro-crate-metadata.json. Intermediate files, when present, should also be included in the crate, but what matters from the RO-Crate metadata perspective is the path relative to the RO-Crate root (the plugin could copy intermediate files to the RO-Crate directory).

simleo commented 11 months ago

We have a working group for Workflow Run RO-Crate that meets every two weeks. I think it would be great if you guys joined the group, instructions are here:

https://github.com/ResearchObject/workflow-run-crate/issues/1

bentsherman commented 11 months ago

Thanks everyone for the feedback. I see that the tutorial I originally used as a reference was creating some generic RO crate, but now there is the Workflow Run Crate standard, which looks similar to BCO in its substance, but perhaps more extensible because it is an RO crate.

I decided to remove the minimal RO crate from this PR and just render the BCO manifest. We can add the WRROC as a separate format in a separate PR, and also add the ability to specify render multiple formats for a single run so that they can be composed as needed.

As for the BCO format, the main thing left to do for this PR is to make the workflow inputs point to a URL instead of a local path (e.g. ${NXF_HOME}/.assets/nextflow-io/rnaseq-nf/multiqc -> https://github.com/nextflow-io/rnaseq-nf/tree/master/multiqc). There are several improvements that can still be made, but I will create separate issues for them instead.

Our primary interest is in the BCO format because (as I understand it) the FDA recently adopted it as the standard for research artifacts. Does anyone know if the RO crate standard is a part of this in any way? That will help us prioritize our efforts.

@samuell Thanks for your suggestions and all the testing you did! I will try to incorporate your scripts into this project later on.

bentsherman commented 11 months ago

Updated BCO example with validation errors fixed:

{
    "object_id": "urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858",
    "spec_version": "https://w3id.org/ieee/ieee-2791-schema/2791object.json",
    "etag": "364e510a9602ae31fc0ed6feba5ddd01",
    "provenance_domain": {
        "name": "",
        "version": "",
        "created": "2023-09-27T21:28:13.821355019-05:00",
        "modified": "2023-09-27T21:28:13.821355019-05:00",
        "contributors": [
            {
                "contribution": [
                    "authoredBy"
                ],
                "name": "Paolo Di Tommaso"
            }
        ],
        "license": ""
    },
    "usability_domain": [

    ],
    "extension_domain": [
        {
            "extension_schema": "https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json",
            "scm_extension": {
                "scm_repository": "https://github.com/nextflow-io/rnaseq-nf",
                "scm_type": "git",
                "scm_commit": "d910312506c6539365ed70aacda5068dea9152dd",
                "scm_path": "main.nf",
                "scm_preview": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
            }
        }
    ],
    "description_domain": {
        "keywords": [

        ],
        "platform": [
            "Nextflow"
        ],
        "pipeline_steps": [
            {
                "step_number": 1,
                "name": "641b807d0f3fdb87ca247e807f6e013e",
                "description": "RNASEQ:INDEX (ggal_1_48850000_49020000)",
                "input_list": [
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
                    }
                ]
            },
            {
                "step_number": 2,
                "name": "b0fde0a381b3abf254cba203158d78a5",
                "description": "RNASEQ:FASTQC (FASTQC on ggal_gut)",
                "input_list": [
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
                    }
                ]
            },
            {
                "step_number": 3,
                "name": "7a7e087d9ec32fc6b104c072ef42ee14",
                "description": "RNASEQ:QUANT (ggal_gut)",
                "input_list": [
                    {
                        "uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
                    }
                ]
            },
            {
                "step_number": 4,
                "name": "8ec9b607fc6e5620c5437845fcf92fe2",
                "description": "MULTIQC",
                "input_list": [
                    {
                        "uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
                    },
                    {
                        "uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html"
                    }
                ]
            }
        ]
    },
    "execution_domain": {
        "script": [
            "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
        ],
        "script_driver": "nextflow",
        "software_prerequisites": [
            {
                "name": "Nextflow",
                "version": "23.09.2-edge",
                "uri": {
                    "uri": "https://github.com/nextflow-io/nextflow/releases/tag/v23.09.2-edge"
                }
            }
        ],
        "external_data_endpoints": [

        ],
        "environment_variables": {

        }
    },
    "parametric_domain": [
        {
            "param": "outdir",
            "value": "results",
            "step": "0"
        },
        {
            "param": "reads",
            "value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_{1,2}.fq",
            "step": "0"
        },
        {
            "param": "transcriptome",
            "value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa",
            "step": "0"
        },
        {
            "param": "multiqc",
            "value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc",
            "step": "0"
        }
    ],
    "io_domain": {
        "input_subdomain": [
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
                }
            },
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
                }
            },
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
                }
            },
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
                }
            }
        ],
        "output_subdomain": [
            {
                "mediatype": "",
                "uri": {
                    "filename": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs",
                    "uri": "results/fastqc_ggal_gut_logs"
                }
            },
            {
                "mediatype": "text/html",
                "uri": {
                    "filename": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html",
                    "uri": "results/multiqc_report.html"
                }
            }
        ]
    },
    "error_domain": {
        "empirical_error": {

        },
        "algorithmic_error": {

        }
    }
}

HadleyKing commented 8 months ago

Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).

Very good point! The ones I have used were:

For BCO: JSONSchema

But now as I checked the README, it seems they actually have an option for checking all errors 😄 :

Lazy validation that can iteratively report all validation errors. [link]

For RO-Crate, I've used ro-crate-validator-py

We also have this API endpoint for BCO validation: https://biocomputeobject.org/api/docs/#/BCO%20Management/api_objects_validate_create

ewels commented 7 months ago

ok nice, thanks @HadleyKing!

Doing a quick and dirty test by copying the above example into the Swagger interface I get the following response:

{
  "urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858": {
    "number_of_errors": 0,
    "error_detail": [
      "BCO Valid"
    ],
    "https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json": {
      "number_of_errors": 0,
      "error_detail": [
        "Extension Valid"
      ]
    }
  }
}

So - I think that means that we're looking good..!

nextflow-io / nf-prov

Add BCO format #3