maxpert / marmot

A distributed SQLite replicator built on top of NATS
https://maxpert.github.io/marmot/
MIT License
1.86k stars 42 forks source link

No support for NATS connection retries #87

Closed TylerGillson closed 1 year ago

TylerGillson commented 1 year ago

It would be useful to support configuration options for NATS connection retries in the event that a marmot follower is initialized slightly before its leader. In our use case, two hosts are provisioned simultaneously and the follower host occasionally lags the leader by as much as a few minutes.

Marmot follower error when attempting to connect prior to leader initialization:

Oct 17 00:40:04 two-node-two marmot[1693]: 12:40AM DBG Opening database node_id=2197861447266130575 path=/var/lib/rancher/k3s/server/db/state.db
Oct 17 00:40:04 two-node-two marmot[1693]: 12:40AM DBG Forcing WAL checkpoint node_id=2197861447266130575
Oct 17 00:40:07 two-node-two marmot[1693]: 12:40AM PNC Unable to initialize snapshot storage error="dial tcp X.X.X.X:4222: connect: no route to host" node_id=2197861447266130575
Oct 17 00:40:07 two-node-two marmot[1693]: panic: Unable to initialize snapshot storage
Oct 17 00:40:07 two-node-two marmot[1693]: goroutine 1 [running]:
Oct 17 00:40:07 two-node-two marmot[1693]: github.com/rs/zerolog/log.Panic.(*Logger).Panic.func1({0x1158d4c?, 0x0?})
Oct 17 00:40:07 two-node-two marmot[1693]:         /home/runner/go/pkg/mod/github.com/rs/zerolog@v1.29.1/log.go:376 +0x27
Oct 17 00:40:07 two-node-two marmot[1693]: github.com/rs/zerolog.(*Event).msg(0xc000282300, {0x1158d4c, 0x25})
Oct 17 00:40:07 two-node-two marmot[1693]:         /home/runner/go/pkg/mod/github.com/rs/zerolog@v1.29.1/event.go:156 +0x2c2
Oct 17 00:40:07 two-node-two marmot[1693]: github.com/rs/zerolog.(*Event).Msg(...)
Oct 17 00:40:07 two-node-two marmot[1693]:         /home/runner/go/pkg/mod/github.com/rs/zerolog@v1.29.1/event.go:108
Oct 17 00:40:07 two-node-two marmot[1693]: main.main()
Oct 17 00:40:07 two-node-two marmot[1693]:         /home/runner/work/marmot/marmot/marmot.go:66 +0x70a
Oct 17 00:40:07 two-node-two systemd[1]: marmot.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 17 00:40:07 two-node-two systemd[1]: marmot.service: Failed with result 'exit-code'.
Oct 17 00:40:07 two-node-two systemd[1]: marmot.service: Scheduled restart job, restart counter is at 2.
Oct 17 00:40:07 two-node-two systemd[1]: Stopped Marmot synchronizes the k8s state in SQLite between nodes in a two node topology.
Oct 17 00:40:07 two-node-two systemd[1]: Started Marmot synchronizes the k8s state in SQLite between nodes in a two node topology.
Oct 17 00:40:07 two-node-two marmot[1699]: 12:40AM DBG Opening database node_id=2197861447266130575 path=/var/lib/rancher/k3s/server/db/state.db
Oct 17 00:40:07 two-node-two marmot[1699]: 12:40AM DBG Forcing WAL checkpoint node_id=2197861447266130575
Oct 17 00:40:10 two-node-two marmot[1699]: 12:40AM PNC Unable to initialize snapshot storage error="dial tcp X.X.X.X:4222: connect: no route to host" node_id=2197861447266130575