nextflow-io / nf-prov

Apache License 2.0
27 stars 12 forks source link

Improve WrrocRenderer #33

Open fbartusch opened 5 months ago

fbartusch commented 5 months ago

Add workflow file to crate. Correct path to output files in metadata file. Add CreateActions for tool executions. Add ControlActions. Add SoftwareApplications for processes. Add HowToSteps. Add agent information (orcid, name) to nextflow.config that is incorporated into metadata file.

simleo commented 5 months ago

I've installed the plugin with make install and used the following nextflow.config:

plugins {
    id 'nf-prov@1.1.0'
}

params {
    outdir = 'results'
}

prov {
    enabled = true
    formats {
        wrroc {
            file = "${params.outdir}/ro-crate-metadata.json"
            overwrite = true
            agent {
                name = "John Doe"
                orcid = "https://orcid.org/0000-0000-0000-0000"
            }
        }
    }
}

I had to pin the plugin's version because otherwise Nextflow downloads nf-prov-1.2.2 and uses that (which fails because it does not support wrroc as a format). Then I ran nextflow run tests/test.nf and got a results RO-Crate. Here are the problems I've found:

"result": [
    [
        {
            "@id": "r1.foo.1.txt"
        },
        {
            "@id": "r1.foo.2.txt"
        }
    ]
],

should be:

"result": [
    {
        "@id": "r1.foo.1.txt"
    },
    {
        "@id": "r1.foo.2.txt"
    }
],

I made these changes manually to move on with the review. At this point, Runcrate reads the crate (runcrate report results) without crashing. Here's the output:

action: #606ce08c-cac1-4777-8a1b-83f497a0c63d
  instrument: test.nf (['File', 'SoftwareSourceCode', 'ComputationalWorkflow', 'HowTo'])
  started: 2024-06-20T11:59:14.123266278+02:00
  ended: 2024-06-20T11:59:14.897242324+02:00
  outputs:
    r1.foo.1.txt
    r1.foo.2.txt
    r2.foo.2.txt
    r2.foo.1.txt
    r3.foo.1.txt
    r3.foo.2.txt

action: #71e33a99d3115bb57a7cdfc1b220fbbf
  step: test.nf#main/RNG
  instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
  outputs:
    r1.foo.1.txt
    r1.foo.2.txt

action: #a9b6314416b001fb728ff45fa29d6f5e
  step: test.nf#main/RNG
  instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
  outputs:
    r2.foo.1.txt
    r2.foo.2.txt

action: #45b19d72e22ba819cf02b88e4977bc86
  step: test.nf#main/RNG
  instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
  outputs:
    r3.foo.1.txt
    r3.foo.2.txt

I've also noticed a weird thing: if you run rm results -rf && nextflow run tests/test.nf (without changing the value of outdir in nextflow.config) repeatedly, sometimes (like 1 in 20 times) the ro-crate-metadata.json file is missing from the results dir.

fbartusch commented 4 months ago

Thanks for the comprehensive feedback @simleo . The next time I will check if runcrate can parse the document before asking for feedback ...

I fixed most of the issues, but this causes some headache:

sometimes (like 1 in 20 times) the ro-crate-metadata.json file is missing from the results dir.

I can reproduce the problem. I wrote a script that removes the results directory, reruns the pipeline and checks if ro-crate-metadata.json ist present. It seems that the workflowOutputs mapping is missing entries for some output files.

Next I will use the newest Nextflow version and maybe the newest nf-prov version, maybe this problem is addressed there. For me it doesn't seem that the problem originates from the WrrocRenderer itself.

simleo commented 4 months ago

I've run the workflow again with the plugin at ac05da2, using the same nextflow.config. The issues I've reported in my previous comment seem to be solved (the sometimes-missing metadata file problem should be solved after the integration of c92b21f, if I understand correctly). Great job! This is another round of review.

{
    "@id": "#outdir",
    "@type": "FormalParameter",
    "name": "outdir",
    ...
},
{
    "@id": "#constant",
    "@type": "FormalParameter",
    "name": "constant",
    ...
},
{
    "@id": "#b5236af9-95e9-42b8-9121-a6a57755ee1b",
    "@type": "CreateAction",
    "instrument": {
        "@id": "test.nf"
    },
    "object": [
        {"@id": "#outdir-pv"},
        {"@id": "#constant-pv"}
    ],
    "result": [
        {"@id": "r3.bar.2.txt"},
        {"@id": "r3.bar.1.txt"},
        {"@id": "r2.bar.1.txt"},
        {"@id": "r2.bar.2.txt"},
        {"@id": "r1.bar.1.txt"},
        {"@id": "r1.bar.2.txt"}
    ]
    ...
},
{
    "@id": "#outdir-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#outdir"},
    "name": "outdir",
    "value": "crate"
},
{
    "@id": "#constant-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#constant"},
    "name": "constant",
    "value": "bar"
}

When run with nextflow run tests/test.nf the outcome should be the same except for the "value" fields of the PropertyValue entities, which should be "results" and "foo" (the defaults). Note that the PropertyValue instances link to the corresponding FormalParameters via exampleOfWork.

fbartusch commented 3 months ago

First, thank you for the review @simleo.

bentsherman commented 3 months ago

session.config should contain the resolved configuration. it also contains the resolved params as session.config.params

simleo commented 2 months ago

I've tested the latest version: big improvements! As reported above, the main thing that's missing now is completing the information related to the CreateActions corresponding to the execution of RNG. A remark on this version:

simleo commented 2 months ago
  • If a license is specified in the configuration (e.g. license = "Apache-2.0") it's added to the Root Data Entity. If the specified string is a valid URL it's added with "license": "{@ id: < URL >}". But the CreativeWork entry for the URL of the license is missing in the RO-Crate metadata. Is this needed or does the URL is enough? If yes I try to add it.

If a "license": {"@id": "< URL >"} property is added to the root data entity, then it would be appropriate to add an entity with that id to the crate. For instance:

{
    "@id": "./",
    "@type": "Dataset",
    "license": {
        "@id": "https://spdx.org/licenses/Apache-2.0"
    },
    ...
},
{
    "@id": "https://spdx.org/licenses/Apache-2.0",
    "@type": "CreativeWork"
}
fbartusch commented 3 weeks ago

I tested the created RO-Crate with the new 0.4.0 version of rocrate-validator and fixed the last requirements, @simleo. Added datePublished to Datasetand author to the root data entity. You can find the newest metadata file for the nf-prov example workflow here.

rocrate-validator validate --requirement-severity REQUIRED --profile-identifier provenance-run-crate-0.5 --no-fail-fast --output-format json --output-file /tmp/nf-prov/1.1.0/ro-crate-validation.json  <crate>

{
    "rocrate": "/tmp/nf-prov/1.1.0/demo/",
    "validation_settings": {
        "data_path": "/tmp/nf-prov/1.1.0/demo/",
        "profiles_path": "/home/felix/miniconda3/envs/roc-validator_python3.9_env/lib/python3.9/site-packages/rocrate_validator/profiles",
        "profile_identifier": "provenance-run-crate-0.5",
        "inherit_profiles": true,
        "requirement_severity": "REQUIRED",
        "abort_on_first": false
    },
    "passed": true,
    "issues": []
}

I ran rocrate-validator with --requirement-severity RECOMMENDED and get a very long list of of required and recommended things I should add. I'm working on them now.

simleo commented 3 weeks ago

I ran rocrate-validator with --requirement-severity RECOMMENDED and get a very long list of of required and recommended things I should add. I'm working on them now.

Hold on, @fbartusch, there seems to be a problem with the JSON output format. It looks like it's reporting as REQUIRED some issues that actually correspond to RECOMMENDED shapes. @kikkomep is investigating the problem.

kikkomep commented 3 weeks ago

I ran rocrate-validator with --requirement-severity RECOMMENDED and get a very long list of of required and recommended things I should add. I'm working on them now.

Hold on, @fbartusch, there seems to be a problem with the JSON output format. It looks like it's reporting as REQUIRED some issues that actually correspond to RECOMMENDED shapes. @kikkomep is investigating the problem.

The issue has been fixed in the latest release, version 0.4.2