nextflow-io / nf-prov

Apache License 2.0
23 stars 11 forks source link

Published files sometimes missing in nf-prov output #35

Closed fbartusch closed 1 month ago

fbartusch commented 1 month ago

Bug report

I'm opening the issue here and not in the nf-prov repo, because I don't know if the underlying problem is in the nextflow or in the nf-prov codebase (see Additional context for more info).

Expected behavior and actual behavior

Using the nf-prov plugin, published files are sometimes missing from the Map<Path,Path> workflowOutputs mapping (used here and therefore in bco.json.

Steps to reproduce the problem

Use the nf-prov pluing and its test.nf workflow.

nextflow.config:

plugins {
    id 'nf-prov'
}

params {
    outdir = 'results', 
}

prov {
    formats {
        bco {
            file = "${params.outdir}/bco.json"
            overwrite = true
        }
        dag {
            file = "${params.outdir}/dag.html"
            overwrite = true
        }
        legacy {
            file = "${params.outdir}/manifest.json"
            overwrite = true
        }
    }
}

The problem does not occur every time, it happens ~10% of the time. I'm using a loop and copy the bco.json files every time. faulty files are a bit smaller in file size than correct files.

#!/bin/bash
mkdir /tmp/bco

for i in {0..100}
do
  echo "Run ${i}"
  rm results -rf
  nextflow run tests/test.nf -plugins "nf-prov"
  cp results/bco.json /tmp/bco/bco${i}.json
done

Program output

Here is an excerpt of a problematic bco.json file (five output files missing):

        "output_subdomain": [
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/3a/db2154e7c3f00fdd63f2b6644f6b17/r2.foo.1.txt",
                    "uri": "results/r2.foo.1.txt"
                }
            }
        ]

This is how it should look like:

"output_subdomain": [
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/57/e04ba742c3c9d451bd6c439cd0919d/r3.foo.1.txt",
                    "uri": "results/r3.foo.1.txt"
                }
            },
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/5e/556e0eac9f5feb596c236c7eb1f935/r2.foo.1.txt",
                    "uri": "results/r2.foo.1.txt"
                }
            },
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/1a/c1d66b0614c379a780c9b49ce86c23/r1.foo.1.txt",
                    "uri": "results/r1.foo.1.txt"
                }
            },
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/57/e04ba742c3c9d451bd6c439cd0919d/r3.foo.2.txt",
                    "uri": "results/r3.foo.2.txt"
                }
            },
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/1a/c1d66b0614c379a780c9b49ce86c23/r1.foo.2.txt",
                    "uri": "results/r1.foo.2.txt"
                }
            },
            {
                "mediatype": "text/plain",
                "uri": {
                    "filename": "work/5e/556e0eac9f5feb596c236c7eb1f935/r2.foo.2.txt",
                    "uri": "results/r2.foo.2.txt"
                }
            }
        ]

Environment

Additional context

The underlying problem is, that sometimes published files are missing in the workflowOutputs mapping. This mapping is populated by ProvObserver in the nf-prov repo, but I think a race condition in the nextflow code could be the problem. Maybe onFlowComplete() is called before the all published files are properly handled by the code in PublishDir.groovy and therefore added to the `workflowOutputs mapping.

bentsherman commented 1 month ago

I think I naively assumed that there would not be a race condition, but that is likely what is happening since files can be published concurrently. I will update nf-prov to handle concurrent publish events correctly

bentsherman commented 1 month ago

If you want, you can test my theory making either making the onFilePublish method synchronized or using a ConcurrentHashMap for the workflowOutputs

fbartusch commented 1 month ago

Making the methods synchronized solved the issue. I needed to add @Synchronized also to onProcessComplete and onProcessCached, because sometimes tasks were also missing in the task list given to the plugin.

sstrong99 commented 1 month ago

I am writing a similar plugin, and came across this problem. I was able to solve it by using Collections.synchronized*() objects. I.e.

private Set<TaskRun> tasks = Collections.synchronizedList([])
private Map<Path,Path> workflowOutputs = Collections.synchronizedMap([:])

instead of

private Set<TaskRun> tasks = []
private Map<Path,Path> workflowOutputs = [:]

This has solved the issue as far as my testing can tell. I'm curious though if you guys think that some of the other solutions mentioned here are preferable for some reason? Or if they are all pretty much equivalent.

bentsherman commented 1 month ago

They should all work -- you could synchronize the event handlers, the data structures, use a ConcurrentHashMap, or synchronize with a lock like I did here in nf-boost.

The ConcurrentHashMap is probably the most performant because it should impose less synchronization overhead. Might make a difference for very large pipelines but can't say for sure without testing it.

I will probably just use a lock for nf-prov, that's what we usually do because it gives you better control over the synchronization. Also there appears to be an issue with virtual threads and synchronized methods which makes me wary of using synchronized for now.

bentsherman commented 1 month ago

Should be fixed by c92b21f, will be included in the next release

Feel free to re-open if you see the issue again