purpleidea / mgmt

Next generation distributed, event-driven, parallel config management!
https://purpleidea.com/tags/mgmtconfig/
GNU General Public License v3.0
3.47k stars 308 forks source link

converged-timeout is currently broken #743

Open ffrank opened 4 months ago

ffrank commented 4 months ago

Versions:

Linux jammy 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Description:

The --converged-timeout value is not applied. When supplied, mgmt begins shutting down immediately after assuming a converged state for the first time.

$ ./mgmt run --converged-timeout=3600 lang examples/lang/file0.mcl
2024/03/05 15:41:32 cli: lang: lexing/parsing...
2024/03/05 15:41:32 cli: lang: init...
2024/03/05 15:41:32 cli: lang: interpolating...
2024/03/05 15:41:32 cli: lang: building scope...
2024/03/05 15:41:32 cli: lang: running type unification...
2024/03/05 15:41:32 cli: lang: input: examples/lang/file0.mcl
2024/03/05 15:41:32 cli: lang: tree:
/
├── file0.mcl
└── metadata.yaml
This is: mgmt, version: 0.0.24-199-g296fc484
Copyright (C) 2013-2024+ James Shubin and the project contributors
Written by James Shubin <james@shubin.ca> and the project contributors
15:41:32 main: start: 1709653292375592509
15:41:32 main: working prefix is: /home/felix/.cache/mgmt/
15:41:32 PGP: Imported key: B7BFCF22
15:41:32 main: no seeds specified!
15:41:32 etcd: running...
15:41:32 etcd: watching chooser...
15:41:32 etcd: bootstrapping...
15:41:32 etcd: nominated: jammy=http://localhost:2380
15:41:32 etcd: waiting for server...
15:41:32 etcd: server: runServer: (newCluster=true): jammy=http://localhost:2380
15:41:32 etcd: server: starting...
15:41:33 etcd: server: ready
15:41:33 etcd: connect...
15:41:33 etcd: connected!
15:41:33 etcd: watching nominees...
15:41:33 etcd: nominated: jammy=http://localhost:2380
15:41:33 etcd: watching volunteers...
15:41:33 etcd: volunteering...
15:41:33 etcd: volunteers: jammy=http://localhost:2380
15:41:33 etcd: chooser: (jammy=http://localhost:2380)/(jammy=http://localhost:2380)
15:41:33 etcd: chooser result(+/-): []/[]
15:41:33 etcd: watching endpoints...
15:41:33 main: etcd is ready!
15:41:33 main: waiting...
15:41:33 etcd: volunteers: jammy=http://localhost:2380
15:41:33 etcd: chooser: (jammy=http://localhost:2380)/(jammy=http://localhost:2380)
15:41:33 etcd: chooser result(+/-): []/[]
15:41:33 main: running...
15:41:33 main: waiting...
15:41:33 gapi: generating new graph...
15:41:33 gapi: swap!
15:41:33 gapi: lang: lexing/parsing...
15:41:33 gapi: lang: init...
15:41:33 gapi: lang: interpolating...
15:41:33 gapi: lang: building scope...
15:41:33 gapi: lang: running type unification...
15:41:33 gapi: lang: building function graph...
15:41:33 gapi: lang: function engine initializing...
15:41:33 gapi: lang: function engine starting...
15:41:33 gapi: lang: stream...
15:41:33 gapi: lang: funcs: aggregated send
15:41:33 gapi: generating new graph...
15:41:33 gapi: lang: running interpret...
15:41:33 main: new graph took: 94.156µs
15:41:33 engine: autoedge: adding autoedges...
15:41:33 main: auto edges took: 75.351µs
15:41:33 engine: autogroup: algorithm: wrappedGrouper: NonReachabilityGrouper...
15:41:33 main: auto grouping took: 41.668µs
15:41:33 main: send/recv building took: 801ns
15:41:33 main: commit...
15:41:33 engine: graph sync...
15:41:33 main: graph: Vertices(2), Edges(0)
15:41:33 main: waiting...
15:41:33 main: converged for 3600 seconds, exiting!
15:41:33 engine: file[/tmp/dir1/]: Working...
15:41:33 engine: file[/tmp/dir1/]: CheckApply(true)
15:41:33 engine: file[/tmp/dir1/]: CheckApply(true): Return(true, <nil>)
15:41:33 engine: file[/tmp/file1]: Working...
15:41:33 engine: file[/tmp/file1]: CheckApply(true)
15:41:33 engine: file[/tmp/file1]: CheckApply(true): Return(true, <nil>)
15:41:33 main: destroy...
15:41:33 main: deploy: exited
15:41:33 main: loop: exited
15:41:33 engine: graph sync...
15:41:33 engine: file[/tmp/file1]: Exited...
15:41:33 engine: file[/tmp/dir1/]: Exited...
15:41:33 etcd: destroy...
15:41:33 etcd: got chooser shutdown...
15:41:33 etcd: unvolunteering...
15:41:33 etcd: list of volunteers is empty
15:41:33 etcd: chooser: (jammy=http://localhost:2380)/()
15:41:33 etcd: chooser result(+/-): []/[jammy]
15:41:33 etcd: unnominated: shutting down 1 members...
15:41:33 etcd: unnominate: removing self...
15:41:33 etcd: list of nominations is empty
15:41:33 etcd: server: destroyServer...
15:41:33 etcd: server: runServer: done!
15:41:33 etcd: server: destroyServer: done!
15:41:33 main: goodbye!

Note especially:

15:41:32 main: start: 1709653292375592509
...
15:41:33 main: converged for 3600 seconds, exiting!
ffrank commented 4 months ago

From my reading: When the converger was rewritten, it got shiny new StartTimer() and ResetTimer() methods...but those do not seem to be called. Instead we add a "StateFn" that fires when we transition from not converged to converged, and immediately shuts down the main loop.

ffrank commented 4 months ago

As a side note, I think tests are currently passing only because FileRes and friends manage to finish their work before the shutdown hits. (I have added a shell test for my rewritten MountRes and this one does NOT manage to CheckApply before the premature finish.)

purpleidea commented 4 months ago

I am not surprised if this has rotted. Further more the converger package was probably some of the first code I wrote, and it is probably terrible and needs fixing. Thank you for letting me know!

ffrank commented 4 months ago

Upon further reading, I now think it wasn't the 2019 rewrite of the converger package, but rather the 2019 rewrite of the core algorithm in 253ed78 that made us lose those method calls 😅

purpleidea commented 4 months ago

Could be. I'll definitely have to dig into it at some point... My reprieve is that I don't think it's a very high-priority feature anymore, but if you have a real-world use case where it's required, please let me know. This could definitely be a "thing we add when a customer needs it" sort of deal.

ffrank commented 4 months ago

No I agree that it's a terrible crutch in our test shell scripts, and should be replaced with something more sustainable.