Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error

gulavanir commented 3 years ago

Nomad version

1.0.1

Operating system and Environment details

Windows

Issue

Nomad is unable to restart with following error

2021-02-24T17:09:51.874Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\nomad\plugins
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] client: using state directory: state_dir=C:\ProgramData\nomad\client
2021/02/24 17:09:56.896817 bolt.Close(): funlock error: The segment is already unlocked.

Reproduction steps

There isn't a specific scenario that leads to this state but we have seen this happening on multiple clusters which are long running, specifically on the windows nodes. The nomad service on windows nodes goes in loop trying to restart and failing with the same error.

Removing the client state db file and restarting the nomad solves the issue but needs manual intervention and the issue can reoccur too.

shoenig commented 3 years ago

Hi @gulavanir, thanks for reporting.

Looks like this is probably a very old bug in the version of boltdb we use (v1.3.1 from June 2017). When Nomad agent stops it will call Close on the store, sometimes running into the bug described in https://github.com/etcd-io/bbolt/issues/60.

I'm not sure what the implications of migrating to the etcd-io fork of boltdb are, we'll need to go through and investigate any other possible issues before making a switch (or cherry-picking the one fix).

idrennanvmware commented 3 years ago

thanks for looking in to this @shoenig !!

For some context, about 75% of nodes in a cluster (for us) are actually windows nodes. So we expect to see this more and more as we roll things out. If there's any information you'd like us to gather that would help in any way, please let us know what is most useful.

Thanks!

raunak2004 commented 3 years ago

We ran into the same issue on one of our clusters managed to run 'nomad operator debug -duration=5m -interval=30s -server-id=all -node-id=all' attaching the debug archive here:

nomad-debug-2021-03-09-014419Z.tar.gz

Thanks, Raunak

tgross commented 3 years ago

Hey @idrennanvmware which version of Windows are you deploying on? That might help us narrow down the behavior.

gulavanir commented 3 years ago

Hey @tgross we are using "Windows server core 2019 LTS"

notnoop commented 3 years ago

@raunak2004 @gulavanir Sorry for the long delay. I'm investigating this issue further now. The following will be immensely helpful for debugging the issue for the next time it happens:

The full logs of the agent after failed restart. From the sample, I would have expected ==> Starting Nomad agent... and ==> Error starting agent lines
- If those lines are dropped somehow in the log collection pipeline, I'd suggest starting nomad agent manually on the host, e.g. nomad agent -config c:\....\config.hcl
- Sadly, nomad debug will not capture these initial logs - the debug command is useful for collecting logs on a running agent and isn't so useful for errors that generate very early on in the agent lifecycle or ones that cause it to crash.
The "corrupted" state.db file from the client data-dir directory, if it doesn't contain super sensitive info. You can send it to nomad-oss-debug@hashicorp.com or our Enterprise support.

Thank you so much for your patience.

schmichael commented 2 years ago

As per Seth's original comment it does seem like this has been fixed upstream, however both Nomad and Consul (as of 1.1.0) are still on the unmaintained original boltdb package. We've found no evidence that switching could cause issues, but we've also not been able to test extensively.

notnoop commented 2 years ago

Want to clarify the etcd-io/bbolt issue, as I don't think it applies here. The issue https://github.com/etcd-io/bbolt/issues/60 , fixed in https://github.com/etcd-io/bbolt/pull/64 , was a regression in etcd-io/bbolt introduced in https://github.com/etcd-io/bbolt/pull/35/files#diff-1efe626cba25e2ad261690300496ba6e83057a4d24a338d5559e6f01f2fb770bR62-R72 . One of dangers danger of actively developed project! The original boltdb library uses db.lockfile.Fd() properly for locking in https://github.com/boltdb/bolt/blob/v1.3.1/bolt_windows.go#L76-L90 .

Also, I suspect the the failure might be a red-harring or a secondary factor. The funlock error: The segment is already unlocked log line is generated on Raft DB shutdown when the agent startup is already aborted. The triggering cause for agent shutdown is typically included in the "Error starting agent" log line.

By trying to start the Nomad agent manually, I hope all Nomad output will be visible and pin point us to the problem/solution.

notnoop commented 2 years ago

I dug into this further, and I think I have a working theory. I believe there are at least three contributing factors and I was able to reproduce some of the symptoms though not a full reproduction of the problem.

Theory

My current theory is that this is a metastable failure, where Nomad crashed for unexpected reasons (e.g. memory overload, segmentation fault, etc); but without releasing some OS locks. Then Nomad fails to restart until the OS files locks are released or files are deleted.

Thefunlock log error is a symptom of the problem but not the trigger condition.

Details

The easy one is the mystery of missing log indicating why the agent fails to start. Until https://github.com/hashicorp/nomad/pull/11353, we don't capture the error in the log files/syslog. That explains why your log captures are missing this line.

The second issue is Windows file locking behavior: Boltdb, our storage engine, uses LockFileEx API for locking files. However, if we don't cleanly free lock before termination, the file lock may be held which prevents Nomad restart. I can replicate some failure cases where the Nomad process doesn't invoke UnlockFileEx on failure. The documentation states:

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.

If Nomad process restarts and tries to grab the lock, it may fail to do so until the OS unlocks file. Deleting the lock file manually would help I suspect.

Thirdly, when boltdb fails to gets the lock, it invokes funlock/UnlockFileEx in https://github.com/boltdb/bolt/blob/v1.3.1/db.go#L178-L189 and https://github.com/boltdb/bolt/blob/v1.3.1/db.go#L425-L428 . However, the funlock will fail, because it didn't grab the lock in the first place, and generate the spurious event.

Ways to validate

The following info can validate whether the theory is likely:

On failure, deleting the lock file (but not rest of data dir) then restarting agent. If the agent starts up, it's probably a locking issue.
restarting the agent with https://github.com/hashicorp/nomad/pull/11353 fix applied - it will highlight cause of startup failure much better.
Checking how long each restart takes. Currently, the flock call has a 5 second time, and I suspect that's how long the agent takes before it dies. Though, it may be longer due to other launch processes.

We still need to identify the cause of initial Nomad crash - though we must ensure that file locks are always freed so Nomad can recovery well.

I will provide more documentation/details tomorrow hopefully with reproduction steps.

notnoop commented 2 years ago

Following up to report my reproduction. I have confirmed that funlock error gets generated when Nomad fails to lock the file.

On a Windows machine, I started two clients on separate terminals. I started a server/client agent with the following command:

nomad agent -server -client -bootstrap-expect=1 -data-dir $env:temp\data

Then on a second PowerShell terminal, I started a client using the same data-dir that should fail because the first agent holds the lock on data-dir:

nomad agent -client -data-dir $env:temp\data
==> No configuration files loaded
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to initialize client: failed to open state database: timed out while opening database, is another Nomad process accessing data_dir C:\Users\Mahmood\AppData\Local\Temp\data\client?
    2021-10-21T14:27:02.295-0400 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\Users\Mahmood\AppData\Local\Temp\data\plugins
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.309-0400 [INFO]  client: using state directory: state_dir=C:\Users\Mahmood\AppData\Local\Temp\data\client
    2021/10/21 14:27:07.319398 bolt.Close(): funlock error: The segment is already unlocked.

Note that the actual error above is "failed to open state database: timed out while opening database, is another Nomad process accessing data_dir C:\Users\Mahmood\AppData\Local\Temp\data\client?" but it gets skipped from log files because of #11353 issue.

Additionally, I've tried to trigger crashed-without-freeing lock issue independently without luck. I've run the following program in a loop while it simulate crashing but without luck.

I've run the following but without having flock call fail:

go build .
ForEach($n in 1..100000) { .\winflock.exe -iter $n 2>&1 | tee -a output}

main.go

package main import ( "flag" "fmt" "log" "math/rand" "os" "time" "github.com/boltdb/bolt" ) func main() { path := flag.String("path", "./db", "path to database") iter := flag.Int("iter", -1, "iteration count") wait := flag.Duration("wait", 0, "time before shutting down") flag.Parse() if *iter == -1 { rand.Seed(time.Now().Unix()) *iter = rand.Intn(32) } log.SetPrefix(fmt.Sprintf("iter=%03d ", *iter)) log.Printf("starting path=%s wait=%s", *path, *wait) db, err := bolt.Open(*path, 0600, &bolt.Options{Timeout: 5 * time.Second}) log.Printf("bolt.Open completed err=%v", err) if err != nil { return } // simulate an unexpected crash without cleanup if *iter%8 == 0 { log.Printf("CRASHING") os.Exit(2) } if *wait > 0 { time.Sleep(*wait) } db.Close() log.Printf("stopping") }

When this issue happens again, can you also inspect processes and see if there is a rogue Nomad agent that still holds the file lock on datadir?

hashicorp / nomad