Closed alanbchristie closed 3 years ago
Please include the .nextflow.log
file
Here's the entire log, retrieved from the most recent failure...
Feb-18 14:36:06.044 [main] DEBUG nextflow.cli.Launcher - $> nextflow run -c /root/ansible/roles/combine/../../../nextflow/nextflow.config /root/ansible/roles/combine/../../../nextflow/deduplicate.nf -w /work/nextflow -with-report /work/logs/deduplicate_edges.csv_report.html
Feb-18 14:36:06.172 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 21.02.0-edge
Feb-18 14:36:06.211 [main] INFO nextflow.cli.CmdRun - Launching `/root/ansible/roles/combine/../../../nextflow/deduplicate.nf` [stoic_meitner] - revision: 4f84699cac
Feb-18 14:36:06.231 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /root/.nextflow/config
Feb-18 14:36:06.232 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /root/nextflow/nextflow.config
Feb-18 14:36:06.237 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /root/ansible/roles/combine/../../../nextflow/nextflow.config
Feb-18 14:36:06.237 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/.nextflow/config
Feb-18 14:36:06.238 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Feb-18 14:36:06.238 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Feb-18 14:36:06.270 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Feb-18 14:36:07.241 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Feb-18 14:36:07.260 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Feb-18 14:36:07.316 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; plugins-dir=/root/.nextflow/plugins
Feb-18 14:36:07.318 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
Feb-18 14:36:07.321 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins local root: .nextflow/plr/empty
Feb-18 14:36:07.331 [main] INFO org.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Feb-18 14:36:07.332 [main] INFO org.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Feb-18 14:36:07.338 [main] INFO org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Feb-18 14:36:07.351 [main] INFO org.pf4j.AbstractPluginManager - No plugins
Feb-18 14:36:07.432 [main] DEBUG nextflow.Session - Session uuid: 790b7728-e222-441f-ad38-3c7016248bd5
Feb-18 14:36:07.433 [main] DEBUG nextflow.Session - Run name: stoic_meitner
Feb-18 14:36:07.434 [main] DEBUG nextflow.Session - Executor pool size: 2
Feb-18 14:36:07.479 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.AssertionError: Memory unit cannot be a negative number. Expression: (value >= 0). Values: value = -102400
at org.codehaus.groovy.runtime.InvokerHelper.assertFailed(InvokerHelper.java:438)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.assertFailed(ScriptBytecodeAdapter.java:670)
at nextflow.util.MemoryUnit.<init>(MemoryUnit.groovy:62)
at nextflow.cli.CmdInfo.freeSwap(CmdInfo.groovy:173)
at nextflow.cli.CmdInfo.getInfo(CmdInfo.groovy:192)
at nextflow.cli.CmdInfo.status(CmdInfo.groovy:157)
at nextflow.cli.CmdRun.run(CmdRun.groovy:290)
at nextflow.cli.Launcher.run(Launcher.groovy:475)
at nextflow.cli.Launcher.main(Launcher.groovy:657)
Just a thought ...
The error looks impossible because it looks like you're checking the value before you pass it into the object that does the actual assert!
... except ...
Is it actually safe to revisit the os.freeSwapSpaceSize
property after you've tested it?
I'm not familiar with the OperatingSystemMXBean
object but is it safe to expect a naturally dynamic operating system property to remain the same between the time it takes to check it and then use it ... without copying the value first?
In CmdInfo.groovy
(I assume I'm looking at the latest code) we have...
os.freeSwapSpaceSize>=0 ? new MemoryUnit(os.freeSwapSpaceSize) : '-'
But doesn't this make an assumption ... that os.freeSwapSpaceSize
isn't actually changing in the background? i.e. the assumption here is that: no matter how long you wait, the value will always be the same.
I know the two value references are on the same line but they could actually have very different values between being checked and used ... especially as a) the line is probably not atomic (i.e. can be interrupted) and b) you're not in a real-time OS (so significant time can elapse between the check and the use).
And what happens if changes to os.freeSwapSpaceSize
are not atomic? They should be, but what if the underlying implementation (very OS-specific) is updating a multi-byte value in a non-atomic fashion? That's where you can occasionally observe negative values.
Isn't is safer (unless you know how the object behaves) to do something like......
long freeSwapSpaceSize = os.freeSwapSpaceSize
freeSwapSpaceSize>=0 ? new MemoryUnit(freeSwapSpaceSize) : '-'
Damn, it's really annoying this thing. No idea why it's happening but at this point make sense to wrap into a try catch statement
Yep...or copy the value.
I'm only guessing but I can't explain the error any other way ... the underlying data in this case is probably volatile and ever-changing, so you cannot expect a value, once tested, to be the same when you use it ... if you keep going back to the object. You need to copy/snapshot the values so you can make tests and decisions on values that aren't changing every time you refer to them.
The real-time world!
Annoyingly although we don't see this problem (it is a fault that's sensitive to the hardware you're on I guess) our Pharmaceutical customer sees it every time they run our workflow ... so they're blocked.
Should be solved now.
Brilliant - I assume I need to wait for a release of some sort, something I can use in NXF_VER
?
you need to wait for the next release
@pditommaso Is it possible to get a release with this fix in the short term as we've got a customer who is blocked by this? Also, shouldn't the fix also include the suggestion from @alanbchristie that might address the underlying problem:
long freeSwapSpaceSize = os.freeSwapSpaceSize
freeSwapSpaceSize>=0 ? new MemoryUnit(freeSwapSpaceSize) : '-'
Hi Tim, we do an edge release once a month and a stable one twice a year (Apr, Oct).
ps. happy to assist your customer through Seqera Labs
Solved into 21.03.0-edge
Bug report
Failure observed using nextflow: -
20.10.0
21.02.0-edge
Running nextflow in a container image (
FROM python:3.8.6-buster
) in AWS/EKS as part of ak8s
workflow.Expected behavior and actual behavior
When running nextflow I do not expect to see errors, especially asserts, relating to the environment.
But we do observe intermittent instances of
Memory unit cannot be a negative number
assertions in the early stages of nextflow execution. The value reported varies, and we have seen...-4096
-10240
-106496
Two examples are shown below: -
Steps to reproduce the problem
We run our nextflow workflow on AWS/EKS and most of the time it fails.
Program output
Environment
20.10.0
,21.02.0-edge
openjdk version "11.0.9.1" 2020-11-04
GNU bash, version 5.0.3(1)-release (x86_64-pc-linux-gnu)
Additional context