Proxy timeout when auto-scaling machines

andig commented 7 months ago

Migrated from https://github.com/superfly/litefs-example/issues/10

I'm new to litefs and following the example tutorial. The demo application comes up on fly but is unstable. It will serve a few requests and then go into sporadic

Proxy timeout

It seems as if this is cause by the application auto-scaling the second, cloned machine away:

2024-02-24T10:25:45Z runner[e784966ce65358] ams [info]Machine created and started in 17.176s
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]config file read from /etc/litefs.yml
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]LiteFS v0.5.11, commit=63eab529dc3353e8d159e097ffc4caa7badb8cb3
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="host environment detected" type=fly.io
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="no backup client configured, skipping"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="Using static primary: primary=true hostname= advertise-url=http://primary:20202"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="48F8DDD307D7E3CB: primary lease acquired, advertising as http://primary:20202"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="set cluster id on \"static\" lease \"LFSC85E68DD3249EC869\""
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="LiteFS mounted to: /litefs"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="http server listening on: http://localhost:20202"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="waiting to connect to cluster"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="connected to cluster, ready"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="proxy server listening on: http://localhost:8080"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="starting background subprocess: litefs-example [-addr :8081 -dsn /litefs/db]"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]waiting for signal or subprocess to exit
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]database opened at /litefs/db
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]level=INFO msg="database file is zero length on initialization: /var/lib/litefs/dbs/db/database"
2024-02-24T10:25:45Z app[e784966ce65358] ams [info]http server listening on :8081
2024-02-24T10:29:39Z proxy[7842043c462768] ams [info]Downscaling app litefs-example-2 from 2 machines to 1 machines, stopping machine 7842043c462768 (region=ams, process group=app)
2024-02-24T10:29:39Z app[7842043c462768] ams [info] INFO Sending signal SIGINT to main child process w/ PID 313
2024-02-24T10:29:39Z app[7842043c462768] ams [info]sending signal to exec process
2024-02-24T10:29:39Z app[7842043c462768] ams [info]waiting for exec process to close
2024-02-24T10:29:39Z app[7842043c462768] ams [info]signal received, litefs shutting down
2024-02-24T10:29:39Z app[7842043c462768] ams [info]litefs shut down complete
2024-02-24T10:29:39Z app[7842043c462768] ams [info]level=INFO msg="FE9F1F5AB015D813: exiting primary, destroying lease"
2024-02-24T10:29:40Z app[7842043c462768] ams [info] INFO Main child exited normally with code: 0
2024-02-24T10:29:40Z app[7842043c462768] ams [info] INFO Starting clean up.
2024-02-24T10:29:40Z app[7842043c462768] ams [info] INFO Umounting /dev/vdb from /var/lib/litefs
2024-02-24T10:29:40Z app[7842043c462768] ams [info] WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)
2024-02-24T10:29:40Z app[7842043c462768] ams [info]2024/02/24 10:29:40 listening on [fdaa:0:30b1:a7b:242:c65e:fbdc:2]:22 (DNS: [fdaa::3]:53)
2024-02-24T10:29:41Z app[7842043c462768] ams [info][  375.687714] reboot: Restarting system

I assume this is due to the proxy trying to connect to the downscaling machine. As such, the second machine would decrease availability, at least when auto-scaling.

Is this the expected behaviour of the proxy?

benbjohnson commented 7 months ago

We generally don't recommend autostopping machines with LiteFS. Is this the tutorial you were following?

andig commented 7 months ago

I was following this and then on to https://github.com/superfly/litefs-example/tree/main/fly-io-config. Noticed the problem when one of the small machines went away due to being idle.

We generally don't recommend autostopping machines with LiteFS.

I was assuming that the LiteFS Proxy (or the Fly Loadbalancer?) uses the Consul information to direct the request to a running instance. Maybe that's just not how it works.

superfly / litefs

Proxy timeout when auto-scaling machines #430