Open fbartusch opened 5 months ago
I've installed the plugin with make install
and used the following nextflow.config
:
plugins {
id 'nf-prov@1.1.0'
}
params {
outdir = 'results'
}
prov {
enabled = true
formats {
wrroc {
file = "${params.outdir}/ro-crate-metadata.json"
overwrite = true
agent {
name = "John Doe"
orcid = "https://orcid.org/0000-0000-0000-0000"
}
}
}
}
I had to pin the plugin's version because otherwise Nextflow downloads nf-prov-1.2.2
and uses that (which fails because it does not support wrroc
as a format). Then I ran nextflow run tests/test.nf
and got a results
RO-Crate. Here are the problems I've found:
The CreateAction
corresponding to test.nf
(#606ce08c-cac1-4777-8a1b-83f497a0c63d
) has an "id"
property which should be "@id"
instead, so the RO-Crate is not valid. I've manually changed that to "@id"
in order to continue with the review.
The other CreateAction
entities have a "result"
property that has a nested list, which shouldn't be there. For instance, this:
"result": [
[
{
"@id": "r1.foo.1.txt"
},
{
"@id": "r1.foo.2.txt"
}
]
],
should be:
"result": [
{
"@id": "r1.foo.1.txt"
},
{
"@id": "r1.foo.2.txt"
}
],
I made these changes manually to move on with the review. At this point, Runcrate reads the crate (runcrate report results
) without crashing. Here's the output:
action: #606ce08c-cac1-4777-8a1b-83f497a0c63d
instrument: test.nf (['File', 'SoftwareSourceCode', 'ComputationalWorkflow', 'HowTo'])
started: 2024-06-20T11:59:14.123266278+02:00
ended: 2024-06-20T11:59:14.897242324+02:00
outputs:
r1.foo.1.txt
r1.foo.2.txt
r2.foo.2.txt
r2.foo.1.txt
r3.foo.1.txt
r3.foo.2.txt
action: #71e33a99d3115bb57a7cdfc1b220fbbf
step: test.nf#main/RNG
instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
outputs:
r1.foo.1.txt
r1.foo.2.txt
action: #a9b6314416b001fb728ff45fa29d6f5e
step: test.nf#main/RNG
instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
outputs:
r2.foo.1.txt
r2.foo.2.txt
action: #45b19d72e22ba819cf02b88e4977bc86
step: test.nf#main/RNG
instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
outputs:
r3.foo.1.txt
r3.foo.2.txt
The "@id"
of ControlAction
3968b3e5-b5c8-4b24-9e03-d118801e589e
should be #3968b3e5-b5c8-4b24-9e03-d118801e589e
.
The ComputationalWorkflow
has outdir
as an input parameter, but the only parameter you can set when launching the workflow is constant
(which is not listed as an input parameter in the RO-Crate metadata). If I try to change the value of outdir
from results
to crate
(either in nextflow.config
or on the command line with --outdir crate
), files are still output into results
; moreover, in this case the ro-crate-metadata.json
and test.nf
files are missing (they can be found in the crate
dir but only if you create it manually before running the workflow).
I've also noticed a weird thing: if you run rm results -rf && nextflow run tests/test.nf
(without changing the value of outdir
in nextflow.config
) repeatedly, sometimes (like 1 in 20 times) the ro-crate-metadata.json
file is missing from the results
dir.
Thanks for the comprehensive feedback @simleo . The next time I will check if runcrate can parse the document before asking for feedback ...
I fixed most of the issues, but this causes some headache:
sometimes (like 1 in 20 times) the ro-crate-metadata.json file is missing from the results dir.
I can reproduce the problem.
I wrote a script that removes the results directory, reruns the pipeline and checks if ro-crate-metadata.json
ist present. It seems that the workflowOutputs mapping is missing entries for some output files.
Next I will use the newest Nextflow version and maybe the newest nf-prov version, maybe this problem is addressed there. For me it doesn't seem that the problem originates from the WrrocRenderer itself.
I've run the workflow again with the plugin at ac05da2, using the same nextflow.config
. The issues I've reported in my previous comment seem to be solved (the sometimes-missing metadata file problem should be solved after the integration of c92b21f, if I understand correctly). Great job! This is another round of review.
if the workflow is run without passing parameters (nextflow run tests/test.nf
), the #constant
FormalParameter
is missing from the RO-Crate metadata (only #outdir
is present). If it's run with nextflow run tests/test.nf --constant bar
, #constant
is present. It should be present in both cases.
The CreateAction
corresponding to the workfow execution has an empty "object"
. It should contain references to PropertyValue
entities corresponding to the values taken by the parameters during the specific run. For instance, when run with nextflow run tests/test.nf --outdir crate --constant bar
, the result should look like this:
{
"@id": "#outdir",
"@type": "FormalParameter",
"name": "outdir",
...
},
{
"@id": "#constant",
"@type": "FormalParameter",
"name": "constant",
...
},
{
"@id": "#b5236af9-95e9-42b8-9121-a6a57755ee1b",
"@type": "CreateAction",
"instrument": {
"@id": "test.nf"
},
"object": [
{"@id": "#outdir-pv"},
{"@id": "#constant-pv"}
],
"result": [
{"@id": "r3.bar.2.txt"},
{"@id": "r3.bar.1.txt"},
{"@id": "r2.bar.1.txt"},
{"@id": "r2.bar.2.txt"},
{"@id": "r1.bar.1.txt"},
{"@id": "r1.bar.2.txt"}
]
...
},
{
"@id": "#outdir-pv",
"@type": "PropertyValue",
"exampleOfWork": {"@id": "#outdir"},
"name": "outdir",
"value": "crate"
},
{
"@id": "#constant-pv",
"@type": "PropertyValue",
"exampleOfWork": {"@id": "#constant"},
"name": "constant",
"value": "bar"
}
When run with nextflow run tests/test.nf
the outcome should be the same except for the "value" fields of the PropertyValue
entities, which should be "results"
and "foo"
(the defaults). Note that the PropertyValue
instances link to the corresponding FormalParameter
s via exampleOfWork
.
The three CreateAction
s corresponding to the executions of RNG
should also have an "object"
pointing to PropertyValue
instances, which in turn should be connected via exampleOfWork
to FormalParameter
s of the SoftwareApplication
corresponding to RNG
.
Since the output RO-Crate conforms to the Workflow RO-Crate profile, its Root Data Entity must have a "license"
. This can be specified in nextflow.config
, like agent
.
nextflow.config
should be included in the output RO-Crate. In the metadata file, it should be linked via object
from the OrganizeAction
, as specified in the Provenance Run Crate profile.
First, thank you for the review @simleo.
Parameters are now added to RO-Crate metadata regardless if they are specified in the configuration or inside Nextflow's workflow files. This was simple, just use session.getBinding().getParams()
this should return all parameters that are active during workflow execution.
CreateAction
of the workflow references now PropertyValue
s via object
. PropertyValue
link to the FormalParameter
via exampleOfWork
.
If a license is specified in the configuration (e.g. license = "Apache-2.0"
) it's added to the Root Data Entity. If the specified string is a valid URL it's added with "license": "{@ id: < URL >}". But the CreativeWork
entry for the URL of the license is missing in the RO-Crate metadata. Is this needed or does the URL is enough? If yes I try to add it.
I started to think about the nextflow.config
thing. There several places where the configuration can reside. The documentation explains it . There can be four different configuration files that take precedence in a specific order. Maybe it doesn't suffice to just copy all the files into the crate, because how can we decide which one takes precedence over the others? I'm thinking about compiling one 'effective' configuration file containing the values that are valid during workflow exeuction and put that one into the crate.
I still have to do the PropertyValue
stuff to the three CreateActions
corresponding to the RNG
executions.
session.config
should contain the resolved configuration. it also contains the resolved params as session.config.params
I've tested the latest version: big improvements! As reported above, the main thing that's missing now is completing the information related to the CreateAction
s corresponding to the execution of RNG
. A remark on this version:
nextflow.config
, the root data entity is populated with a "license": "null"
property. In this case there should not be any "license"
property at all instead.
- If a license is specified in the configuration (e.g.
license = "Apache-2.0"
) it's added to the Root Data Entity. If the specified string is a valid URL it's added with "license": "{@ id: < URL >}". But theCreativeWork
entry for the URL of the license is missing in the RO-Crate metadata. Is this needed or does the URL is enough? If yes I try to add it.
If a "license": {"@id": "< URL >"}
property is added to the root data entity, then it would be appropriate to add an entity with that id to the crate. For instance:
{
"@id": "./",
"@type": "Dataset",
"license": {
"@id": "https://spdx.org/licenses/Apache-2.0"
},
...
},
{
"@id": "https://spdx.org/licenses/Apache-2.0",
"@type": "CreativeWork"
}
I tested the created RO-Crate with the new 0.4.0 version of rocrate-validator
and fixed the last requirements, @simleo.
Added datePublished
to Dataset
and author
to the root data entity. You can find the newest metadata file for the nf-prov example workflow here.
rocrate-validator validate --requirement-severity REQUIRED --profile-identifier provenance-run-crate-0.5 --no-fail-fast --output-format json --output-file /tmp/nf-prov/1.1.0/ro-crate-validation.json <crate>
{
"rocrate": "/tmp/nf-prov/1.1.0/demo/",
"validation_settings": {
"data_path": "/tmp/nf-prov/1.1.0/demo/",
"profiles_path": "/home/felix/miniconda3/envs/roc-validator_python3.9_env/lib/python3.9/site-packages/rocrate_validator/profiles",
"profile_identifier": "provenance-run-crate-0.5",
"inherit_profiles": true,
"requirement_severity": "REQUIRED",
"abort_on_first": false
},
"passed": true,
"issues": []
}
I ran rocrate-validator
with --requirement-severity RECOMMENDED
and get a very long list of of required and recommended things I should add. I'm working on them now.
I ran
rocrate-validator
with--requirement-severity RECOMMENDED
and get a very long list of of required and recommended things I should add. I'm working on them now.
Hold on, @fbartusch, there seems to be a problem with the JSON output format. It looks like it's reporting as REQUIRED
some issues that actually correspond to RECOMMENDED
shapes. @kikkomep is investigating the problem.
I ran
rocrate-validator
with--requirement-severity RECOMMENDED
and get a very long list of of required and recommended things I should add. I'm working on them now.Hold on, @fbartusch, there seems to be a problem with the JSON output format. It looks like it's reporting as
REQUIRED
some issues that actually correspond toRECOMMENDED
shapes. @kikkomep is investigating the problem.
The issue has been fixed in the latest release, version 0.4.2
Add workflow file to crate. Correct path to output files in metadata file. Add CreateActions for tool executions. Add ControlActions. Add SoftwareApplications for processes. Add HowToSteps. Add agent information (orcid, name) to nextflow.config that is incorporated into metadata file.