nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.76k stars 630 forks source link

Intermittent assert "Memory unit cannot be a negative number" on AWS/EKS #1902

Closed alanbchristie closed 3 years ago

alanbchristie commented 3 years ago

Bug report

Failure observed using nextflow: -

Running nextflow in a container image (FROM python:3.8.6-buster) in AWS/EKS as part of a k8s workflow.

Expected behavior and actual behavior

When running nextflow I do not expect to see errors, especially asserts, relating to the environment.

But we do observe intermittent instances of Memory unit cannot be a negative number assertions in the early stages of nextflow execution. The value reported varies, and we have seen...

Two examples are shown below: -

N E X T F L O W  ~  version 21.02.0-edge
Launching `/root/ansible/roles/combine/../../../nextflow/deduplicate.nf` [evil_nobel] - revision: f747b687a8
Memory unit cannot be a negative number. Expression: (value >= 0). Values: value = -4096
N E X T F L O W  ~  version 21.02.0-edge
Launching `/root/ansible/roles/combine/../../../nextflow/deduplicate.nf` [drunk_baekeland] - revision: 7510b6e6c1
Memory unit cannot be a negative number. Expression: (value >= 0). Values: value = -102400

Steps to reproduce the problem

We run our nextflow workflow on AWS/EKS and most of the time it fails.

There's a lot of data and on a customer's internal cluster so there is no access to the execution or data from outside the company running the workflow.

Program output

N E X T F L O W  ~  version 21.02.0-edge
Launching `/root/ansible/roles/combine/../../../nextflow/deduplicate.nf` [drunk_baekeland] - revision: 7510b6e6c1
Memory unit cannot be a negative number. Expression: (value >= 0). Values: value = -102400

Environment

Additional context

pditommaso commented 3 years ago

Please include the .nextflow.log file

alanbchristie commented 3 years ago

Here's the entire log, retrieved from the most recent failure...

Feb-18 14:36:06.044 [main] DEBUG nextflow.cli.Launcher - $> nextflow run -c /root/ansible/roles/combine/../../../nextflow/nextflow.config /root/ansible/roles/combine/../../../nextflow/deduplicate.nf -w /work/nextflow -with-report /work/logs/deduplicate_edges.csv_report.html
Feb-18 14:36:06.172 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 21.02.0-edge
Feb-18 14:36:06.211 [main] INFO  nextflow.cli.CmdRun - Launching `/root/ansible/roles/combine/../../../nextflow/deduplicate.nf` [stoic_meitner] - revision: 4f84699cac
Feb-18 14:36:06.231 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /root/.nextflow/config
Feb-18 14:36:06.232 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /root/nextflow/nextflow.config
Feb-18 14:36:06.237 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /root/ansible/roles/combine/../../../nextflow/nextflow.config
Feb-18 14:36:06.237 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/.nextflow/config
Feb-18 14:36:06.238 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Feb-18 14:36:06.238 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Feb-18 14:36:06.270 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Feb-18 14:36:07.241 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Feb-18 14:36:07.260 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Feb-18 14:36:07.316 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; plugins-dir=/root/.nextflow/plugins
Feb-18 14:36:07.318 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
Feb-18 14:36:07.321 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins local root: .nextflow/plr/empty
Feb-18 14:36:07.331 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Feb-18 14:36:07.332 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Feb-18 14:36:07.338 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Feb-18 14:36:07.351 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Feb-18 14:36:07.432 [main] DEBUG nextflow.Session - Session uuid: 790b7728-e222-441f-ad38-3c7016248bd5
Feb-18 14:36:07.433 [main] DEBUG nextflow.Session - Run name: stoic_meitner
Feb-18 14:36:07.434 [main] DEBUG nextflow.Session - Executor pool size: 2
Feb-18 14:36:07.479 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.AssertionError: Memory unit cannot be a negative number. Expression: (value >= 0). Values: value = -102400
    at org.codehaus.groovy.runtime.InvokerHelper.assertFailed(InvokerHelper.java:438)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.assertFailed(ScriptBytecodeAdapter.java:670)
    at nextflow.util.MemoryUnit.<init>(MemoryUnit.groovy:62)
    at nextflow.cli.CmdInfo.freeSwap(CmdInfo.groovy:173)
    at nextflow.cli.CmdInfo.getInfo(CmdInfo.groovy:192)
    at nextflow.cli.CmdInfo.status(CmdInfo.groovy:157)
    at nextflow.cli.CmdRun.run(CmdRun.groovy:290)
    at nextflow.cli.Launcher.run(Launcher.groovy:475)
    at nextflow.cli.Launcher.main(Launcher.groovy:657)
alanbchristie commented 3 years ago

Just a thought ...

The error looks impossible because it looks like you're checking the value before you pass it into the object that does the actual assert!

... except ...

Is it actually safe to revisit the os.freeSwapSpaceSize property after you've tested it?

I'm not familiar with the OperatingSystemMXBean object but is it safe to expect a naturally dynamic operating system property to remain the same between the time it takes to check it and then use it ... without copying the value first?

In CmdInfo.groovy (I assume I'm looking at the latest code) we have...

os.freeSwapSpaceSize>=0 ? new MemoryUnit(os.freeSwapSpaceSize) : '-'

But doesn't this make an assumption ... that os.freeSwapSpaceSize isn't actually changing in the background? i.e. the assumption here is that: no matter how long you wait, the value will always be the same.

I know the two value references are on the same line but they could actually have very different values between being checked and used ... especially as a) the line is probably not atomic (i.e. can be interrupted) and b) you're not in a real-time OS (so significant time can elapse between the check and the use).

And what happens if changes to os.freeSwapSpaceSize are not atomic? They should be, but what if the underlying implementation (very OS-specific) is updating a multi-byte value in a non-atomic fashion? That's where you can occasionally observe negative values.

Isn't is safer (unless you know how the object behaves) to do something like......

long freeSwapSpaceSize = os.freeSwapSpaceSize
freeSwapSpaceSize>=0 ? new MemoryUnit(freeSwapSpaceSize) : '-'
pditommaso commented 3 years ago

Damn, it's really annoying this thing. No idea why it's happening but at this point make sense to wrap into a try catch statement

alanbchristie commented 3 years ago

Yep...or copy the value.

I'm only guessing but I can't explain the error any other way ... the underlying data in this case is probably volatile and ever-changing, so you cannot expect a value, once tested, to be the same when you use it ... if you keep going back to the object. You need to copy/snapshot the values so you can make tests and decisions on values that aren't changing every time you refer to them.

The real-time world!

alanbchristie commented 3 years ago

Annoyingly although we don't see this problem (it is a fault that's sensitive to the hardware you're on I guess) our Pharmaceutical customer sees it every time they run our workflow ... so they're blocked.

pditommaso commented 3 years ago

Should be solved now.

alanbchristie commented 3 years ago

Brilliant - I assume I need to wait for a release of some sort, something I can use in NXF_VER?

pditommaso commented 3 years ago

you need to wait for the next release

tdudgeon commented 3 years ago

@pditommaso Is it possible to get a release with this fix in the short term as we've got a customer who is blocked by this? Also, shouldn't the fix also include the suggestion from @alanbchristie that might address the underlying problem:

long freeSwapSpaceSize = os.freeSwapSpaceSize
freeSwapSpaceSize>=0 ? new MemoryUnit(freeSwapSpaceSize) : '-'
pditommaso commented 3 years ago

Hi Tim, we do an edge release once a month and a stable one twice a year (Apr, Oct).

pditommaso commented 3 years ago

ps. happy to assist your customer through Seqera Labs

pditommaso commented 3 years ago

Solved into 21.03.0-edge