nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 632 forks source link

Conda OverlappingFileLockException may not be resolved? #5530

Open jfy133 opened 4 days ago

jfy133 commented 4 days ago

Bug report

I am running nf-core/taxprofiler on gitpod with conda.

When I run the basic test profile: nextflow run nf-core/taxprofiler -profile test,conda --outdir results/, I get the OverlappingFileLockException error again.

This is despite using the latest patch version of Nextflow, as reported at the top of the console on execution

 N E X T F L O W   ~  version 24.10.1

Expected behavior and actual behavior

Nextflow can build multiple conda environments simulteanously without any overlapping lock files

Steps to reproduce the problem

Open nf-core/taxprofiler repo on gitpod.io

Run

nextflow run nf-core/taxprofiler -profile test,conda --outdir results/

Program output

Plus 12 more processes waiting for tasks…
Creating env using conda: /workspace/taxprofiler/./workflows/../subworkflows/local/../../modules/nf-core/nonpareil/nonpareil/environment.yml [cache /workspace/taxprofiler/testing/work/conda/env-ad3bc2eba60b0862-96eea98fae0f03231a4ca48234a15efd]
Creating env using conda: /workspace/taxprofiler/./workflows/../subworkflows/local/../../modules/nf-core/bbmap/bbduk/environment.yml [cache /workspace/taxprofiler/testing/work/conda/env-a0f2bf3ebbb3c91a-06878572316a4c4de51f2bdb49c72937]
ERROR ~ Error executing process > 'NFCORE_TAXPROFILER:TAXPROFILER:SHORTREAD_PREPROCESSING:FASTQC_PROCESSED (1)'

Caused by:
  java.nio.channels.OverlappingFileLockException

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details
-[nf-core/taxprofiler] Pipeline completed with errors-
WARN: Killing running tasks (5)

Environment

Additional context

(Add any other context about the problem here)

nschan commented 4 days ago

The same issue occurred a while ago in the charliecloud driver, and I am pretty sure that it is caused by nextflow's FileMutex(), but I never figured out how / why.

pditommaso commented 4 days ago

Please include the full log file

jfy133 commented 4 days ago

Please include the full log file

Sorry I couldn't include that at the time, was having internet issues so the only thing I could get was the console.

I'm away from a computer until Monday, @nschan if you can share a log from your side that'll would be appreciated.

I'll also ask on nf-core if anyone else gets the same thing with other pipelines

nschan commented 4 days ago

I do not have logs for this anymore, the problem in charliecloud was fixed a while ago.

jfy133 commented 3 days ago

I was able to to distract my son for 5 minutes to try again, and I got the same error again (although interestingly @mahesh-panchal was trying independently also on gitpod and did not get it)

Log file here:

nextflow(1).log

pditommaso commented 3 days ago

Thanks, the relevant information is this

java.lang.IllegalStateException: java.nio.channels.OverlappingFileLockException
    at nextflow.conda.CondaCache.getCachePathFor(CondaCache.groovy:395)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.processor.TaskRun.getCondaEnv0(TaskRun.groovy:647)
    at jdk.internal.reflect.GeneratedMethodAccessor158.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
    at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnCurrentN(ScriptBytecodeAdapter.java:96)
    at nextflow.processor.TaskRun.this$dist$invoke$1(TaskRun.groovy)
    at nextflow.processor.TaskRun$2.methodMissing(TaskRun.groovy)
    at jdk.internal.reflect.GeneratedMethodAccessor157.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
    at groovy.lang.MetaClassImpl.invokeMissingMethod(MetaClassImpl.java:924)
    at groovy.lang.MetaClassImpl.invokePropertyOrMissing(MetaClassImpl.java:1413)
    at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1335)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.processor.TaskRun$2.apply(TaskRun.groovy:638)
    at nextflow.processor.TaskRun$2.apply(TaskRun.groovy)
    at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1708)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.processor.TaskRun.getCondaEnv(TaskRun.groovy:635)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.processor.TaskProcessor.createTaskHashKey(TaskProcessor.groovy:2224)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.processor.TaskProcessor.invokeTask(TaskProcessor.groovy:654)
    at nextflow.processor.InvokeTaskAdapter.call(InvokeTaskAdapter.groovy:52)
    at groovyx.gpars.dataflow.operator.DataflowOperatorActor.startTask(DataflowOperatorActor.java:120)
    at groovyx.gpars.dataflow.operator.ForkingDataflowOperatorActor.access$001(ForkingDataflowOperatorActor.java:35)
    at groovyx.gpars.dataflow.operator.ForkingDataflowOperatorActor$1.run(ForkingDataflowOperatorActor.java:58)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.nio.channels.OverlappingFileLockException: null
    at java.base/sun.nio.ch.FileLockTable.checkList(FileLockTable.java:229)
    at java.base/sun.nio.ch.FileLockTable.add(FileLockTable.java:123)
    at java.base/sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1321)
    at java.base/java.nio.channels.FileChannel.tryLock(FileChannel.java:1191)
    at nextflow.file.FileMutex.lock(FileMutex.groovy:90)
    at nextflow.conda.CondaCache.createLocalCondaEnv(CondaCache.groovy:267)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
    at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
    at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
    at nextflow.conda.CondaCache$_getLazyImagePath_closure3.doCall(CondaCache.groovy:372)
    at nextflow.conda.CondaCache$_getLazyImagePath_closure3.call(CondaCache.groovy)
    at groovyx.gpars.dataflow.LazyDataflowVariable$1.run(LazyDataflowVariable.java:70)
jfy133 commented 3 days ago

Note it happens less if I re-run repeatedly (Without resume). So it (subjectively) appears to only happen more often on the first run in a given directory.

ewels commented 3 days ago

Makes sense if it's due to Nextflow concurrently creating new Conda environments. Re-runs will use existing envs from previous runs.

mahesh-panchal commented 3 days ago

So far I've done 5 runs (of taxprofiler) in Gitpod, sometimes with fresh Gitpod, and other times doing rm -rf .nextflow* work results, and haven't got it yet. It's not an easy one to reproduce.

jorgee commented 3 days ago

I have the impression it could be a race condition between two processes using the same conda file.

nschan commented 3 days ago

Based on my observations with charliecloud, I think this error will occur when two processes that use the same environment try to create it simultaneously and somehow both manage to create the FileMutex() file. I am obviously not an expert, but with charliecloud I had 0 success in avoiding this (or implementing any other locking mechanism), I am mainly surprised this only surfaced now in conda and wonder why this does not happen with e.g. singularity, which uses exactly the same code in the cache.

See discussion of this issue here: https://github.com/nextflow-io/nextflow/issues/3367 and here https://github.com/nextflow-io/nextflow/pull/4712#issuecomment-1934009475

jorgee commented 3 days ago

I think the synchronized in this part of code should include the whole method https://github.com/nextflow-io/nextflow/blob/e025a846ecc471c42277ac17e800a4d59c15611b/modules/nextflow/src/main/groovy/nextflow/conda/CondaCache.groovy#L364-L374

pditommaso commented 3 days ago

Indeed, likely the sync "key" needs to be reviewed

jorgee commented 21 hours ago

I was wrong, the issue is not the sync. It is happening when the path in the conda directive is the same but it is passed as a different string. I mean "/nf-core/taxprofiler/workflows/../subworkflows/local/../../modules/nf-core/fastqc/environment.yml" is the same as "/nf-core/taxprofiler/workflows/modules/nf-core/fastqc/environment.yml" but as strings are different. I think it is happening with these three processes: 'NFCORE_TAXPROFILER:TAXPROFILER:FASTQC' , 'NFCORE_TAXPROFILER:TAXPROFILER:SHORTREAD_PREPROCESSING:FASTQC_PROCESSED' and 'NFCORE_TAXPROFILER:TAXPROFILER:LONGREAD_PREPROCESSING:FASTQC_PROCESSED'.

pditommaso commented 21 hours ago

Then using path.toAbsolutePath().normalize() should solve