superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
4.05k stars 96 forks source link

Considering LiteFS for production #259

Open andyhorng opened 1 year ago

andyhorng commented 1 year ago

I already use "Litestream" as my primary database solution with Fly.io. This architecture is very simple and I really like it. However, the deployment time takes too long, and I can't do a rolling restart because of the volume needs to be unmounted and remounted to the new instance.

So, my question is, I am aware that LiteFS is currently in beta and not recommended for use in production. However, in my case, I believe it is worth trying. I would like to know what potential problems or caveats I might encounter when using it in production and how to avoid them.

Maybe we can list out those potential problems and create a document for others who are also considering using LiteFS in production.

benbjohnson commented 1 year ago

That's a good idea. The biggest hurdle right now is that using Consul makes it so your app needs to know how to redirect requests to the primary and that can get a little wonky—especially when running migrations as part of a deployment.

I'm reworking the docs to focus more on using the static leasing (where a single node is always primary). That's much simpler to get working for application developers. The trade-off is that you'll have some write availability loss during deploy but we have improvements coming to apps to reduce deploy times significantly.

We have also have some improvements on dynamic leasing with Consul coming in the near future so it won't be such a pain. :)

gnat commented 1 year ago

Good direction focusing on static leasing. Much simpler setup/adoption story for your typical dev adopter to reconfigure manually per deployment (especially when deployments happen multiple times per day) versus also provisioning consul.

andyhorng commented 1 year ago

Sounds great! I'm excited to see your work. The static lease is a really good idea, I like the trade-off, and it's well-suited for most of the web apps I've developed.

I'm curious about how the static leasing will work. Does Fly provide any mechanism for us to have different environment variables if it's the primary node? That seems like a good way to implement it.

benbjohnson commented 1 year ago

Does Fly provide any mechanism for us to have different environment variables if it's the primary node?

We don't currently have a concept of "primary node" inside Fly.io since many applications used our nodes ephemerally. We have talked about add that though.

The easiest way to run a static setup is decide on a primary region and deploy 1 node there. Set an environment variable called PRIMARY_REGION with that region and then you can reference it as ${PRIMARY_REGION}.${FLY_APP_NAME}.internal

We are also rolling out a new version of Fly Apps that will have more stable hostnames so you'll be able to reference those instead.

jwhear commented 1 year ago

I am trying to get this working right now, but the hostname option has to be set on the replica nodes to the hostname of the primary--something I don't know until after I've deployed on Fly.

-- EDIT -- I think I get it; you're suggesting to not fly-replay to a specific instance but to a region. So check for the .primary file but don't use the contents for now.

benbjohnson commented 1 year ago

I think I get it; you're suggesting to not fly-replay to a specific instance but to a region. So check for the .primary file but don't use the contents for now.

Yes, that's correct. ${PRIMARY_REGION}.${FLY_APP_NAME}.internal should work to reference the region (assuming you set the PRIMARY_REGION environment variable).

benbjohnson commented 1 year ago

@jwhear If you post the litefs.yml config file then I can give feedback on that too.

jwhear commented 1 year ago

This combo is working is working for me:

# Directory where the application will find the database
fuse:
  dir: "/data"

# Directory where the LiteFS will actually place LTX files
data:
  dir: "/var/lib/litefs"
  retention: "1h"
  retention-monitor-interval: "5m"

lease:
  type: "static"
  candidate: ${FLY_REGION == "dfw"}
  advertise-url: "http://${PRIMARY_REGION}.${FLY_APP_NAME}.internal:20202"

# In production we want to exit on error and thus restart the instance
# On staging leave everything as-is so that we can SSH in and diagnose
exit-on-error: ${FLY_APP_NAME != "work-staging"}

In the app, if the request would cause a DB write and the .primary file exists, set fly-replay: region={PRIMARY_REGION} and send an empty response.

benbjohnson commented 1 year ago

Nice. Yeah, that looks good. You can also compare two environment variables so you could do:

lease:
  candidate: ${FLY_REGION == PRIMARY_REGION}
walterwanderley commented 1 year ago

That's a good idea. The biggest hurdle right now is that using Consul makes it so your app needs to know how to redirect requests to the primary and that can get a little wonky—especially when running migrations as part of a deployment.

I'm reworking the docs to focus more on using the static leasing (where a single node is always primary). That's much simpler to get working for application developers. The trade-off is that you'll have some write availability loss during deploy but we have improvements coming to apps to reduce deploy times significantly.

We have also have some improvements on dynamic leasing with Consul coming in the near future so it won't be such a pain. :)

Thank you for this great project!

I can't use Consul, and i use LiteFS as a library, then just create a RAFT based leasing to leader election and forward requests.This is incorporated to a boilerplate code generation tool called sqlc-grpc. This is experimental and i need to monitor and check the trade-offs. There are plans to embed an HA leasing system without external dependencies into LiteFS?

benbjohnson commented 1 year ago

There are plans to embed an HA leasing system without external dependencies into LiteFS?

I'm not sure. Having run Raft-based clusters, it's kind of a pain. Also, adding distributed consensus to your application nodes can be problematic when they're under high load as they can loose leadership easily. I've found that moving the leasing system off the application nodes is usually a good approach.

walterwanderley commented 1 year ago

That makes sense. Thank you!

tjheeta commented 1 year ago

I'm planning a cutover of my staging environment shortly as 0.3 appears to be good enough feature-wise.

Some questions around reliability though: 1) Has there been any data-consistency / long-term testing / production use? From my preliminary testing, the failover on node down seems very solid with consul and haven't had any issues with replication so far on dev. Key word here is so far. 2) When is the approximate ETA for 0.4? 1 month? 6 months? 1 year? I see there are a few fixes on 0.3 features in the trunk, so I suppose I could run off that. 3) Will 0.4 be a drop-in replacement for 0.3?

benbjohnson commented 1 year ago

Has there been any data-consistency / long-term testing / production use?

We use LiteFS internally at Fly.io but we use it with the static lease because it works better for our particular set up. We do a long-running chaos test with the Consul lease for each PR too which runs a geographically spread-out cluster where nodes are randomly killed every few minutes.

@kentcdodds has been running with the Consul lease for a while too. He may be able to chime in with additional information.

It's always good to have additional backups though. If you're running on Fly.io, we take automatic snapshots every 24 hours. LiteFS v0.4.0 will also have a litefs export command to perform a fast snapshot to disk. That can be good if you want to take more frequent backups.

When is the approximate ETA for 0.4?

I'm expecting the v0.4.0 release to go out early next week. I was expecting to get it out in March but we had some internal projects that I needed to jump on temporarily.

Will 0.4 be a drop-in replacement for 0.3?

From a data file perspective, yes. There's nothing to upgrade in that sense. However, some backwards incompatible internal API changes needed to be made to support compression (#132) so you can't run a mixed-version cluster (e.g. v0.3.0 nodes with v0.4.0 nodes).

There are also some minor cosmetic changes in the litefs.yml configuration file and the FUSE library has been upgraded so it requires fuse3 instead of just regular fuse on your VM.

kentcdodds commented 1 year ago

I forgot that I scaled down to 1 node a while back so most of the time I've been running Consul with a single node and infrequent deploys. I just scaled up to 9 instances and it started up without issue. I'll report back if I have any trouble. I'm currently running sha-9ff02a3 which is some iteration of v0.4.0 (mostly because I like and use the http-proxy). With all that context, I've been running issue-free for a couple months. Happy with that. I'll let you know if I run into any bumps now that I'm back to multi-instance.

tjheeta commented 1 year ago

@kentcdodds - cool. A few more questions if you have the time. 1) Are you nomad or on machines? I'm trying to figure out if I need to run my own consul cluster right now. There seems to be some graphql api that allows it to be enabled. 2) What's the approximate write qps that you're dealing with?

kentcdodds commented 1 year ago

@tjheeta, first I'll say that I'm a bit in over my head when it comes to infra. I just have very particular requirements I have placed upon myself and I suppose that comes with being the guinea pig for some new tech even when I don't really know what I'm doing.

My use case is my personal website: https://kentcdodds.com

So while I am pretty small scale, don't close the tab yet. I actually have some pretty unique features to my site and I get a lot of traffic for a personal website. I also publish my analytics: https://kcd.im/kcd-fathom as well as my real-world performance metrics: https://kcd.im/kcd-metronome

My site is also completely open source so you can take a look at the source code as well if you like: https://github.com/kentcdodds/kentcdodds.com

With that context...

  1. I'm running on regular virtual machines on Fly. I'm not hosting my own consul cluster.
  2. I don't have a good way to measure this unfortunately (I'm sure someone smarter than me would be able to gaze into some SQLite crystal ball and divine this metric though). I can tell you that during regular operation (not after publish a blog post or hitting hacker news) I was serving up to 15 HTTP requests per second when I was running on a single node and I'm guessing that each network request results in anywhere from 6-24 queries. So if we do some back-of-the-napkin math, we're only in the hundreds of queries a second range.

So yeah, not very large scale I'm afraid. I look forward to more people throwing heavier scale at LiteFS. I'm confident it can handle it.

I hope that's helpful!

kentcdodds commented 1 year ago

Oh, I should also mention that I've got two databases in LiteFS on my site. The cache is probably in the range of upper hundreds of queries per second, possibly thousands. Still, smaller scale, but definitely more than "simple blog" level stuff I think :)

benbjohnson commented 1 year ago

@tjheeta I've worked with Kent a bit on debugging his site since he was a super early adopter of LiteFS so I can try to answer a bit.

Are you nomad or on machines?

He's on nomad (apps v1) so he's using the multi-tenant Consul cluster we provide. We'll be making that available on machines & apps v2 soon but I don't have an ETA for that.

What's the approximate write qps that you're dealing with?

Kent answered this already but I'll give a little extra info. LiteFS on FUSE is typically good for tens of write transactions per second because the FUSE overhead of the syscalls is fairly high.

I am planning to make a VFS version of LiteFS available in the near future. That will eliminate that syscall overhead but I would expect throughput to be about 50% of the normal SQLite write throughput since it has double the writes (once to the WAL & once to the LTX transaction file). I think that should handle in the thousands of writes per second once it's decently optimized. My plan is to make it so you simply run a load_extension() and all the other LiteFS configuration is the same.

tjheeta commented 1 year ago

Thanks for all the information. Staging environment is running on litefs now out of 3 regions and using the multi-tenant consul provided by fly on machines. Not sure that I should be, but I am. There was a bit of a hiccup bringing a cloned machine up, but I could not isolate it to litefs.

LiteFS on FUSE is typically good for tens of write transactions per second because the FUSE overhead of the syscalls is fairly high.

That is surprisingly low, however, still should work for my use case.

He's on nomad (apps v1) so he's using the multi-tenant Consul cluster we provide. We'll be making that available on > machines & apps v2 soon but I don't have an ETA for that.

Found out that it is possible to enable the consul cluster on machines via https://api.fly.io/graphql . This doesn't give a FLY_CONSUL_URL, but the url is usable.

Request:

 mutation{
   enablePostgresConsul(input:{appId: "someappid"}) {
     consulUrl
   }
 }

Response

 {
   "data": {
     "enablePostgresConsul": {
       "consulUrl": "https://someurl"
     }
   }
 }

However, given all the recent hubbub around reliability/consul, should the multi-tenant fly consul be used?

benbjohnson commented 1 year ago

That is surprisingly low, however, still should work for my use case.

Yeah, the long-term aim is to provide a seamless experience for smaller applications and those tend to have very low write throughput. Honestly, many web applications are less than 1 write/sec. Reads are still quite fast as SQLite can leverage the OS page cache.

Medium-sized applications that need higher write throughput will need to load an extension, which isn't too hard, but also not quite as seamless.

However, given all the recent hubbub around reliability/consul, should the multi-tenant fly consul be used?

We don't have any plans to discontinue the multi-tenant Consul support. Multi-tenancy always has its trade-offs though so you may have better reliability if you ran your own Consul app.

tjheeta commented 1 year ago

I'm looking to have the primary stay only in a few regions. I see there are a few issues that potentially may do what I'm looking for:

https://github.com/superfly/litefs/issues/176 https://github.com/superfly/litefs/issues/178

It doesn't look like there's an IN operator right now:

candidate: ${FLY_REGION in ["dfw", "ord", "abc"]}

but is there an OR operator currently?

candidate: ${FLY_REGION == "dfw"} ||  ${FLY_REGION == "ord"} ||  ${FLY_REGION == "abc"}
benbjohnson commented 1 year ago

I'm hesitant to add more complicated expression parsing because it opens up a can of worms (e.g. if there's OR then there should be AND and probably parenthesis). I think the best option is to set an environment variable in a bash script and then embed that in the config.

#!/bin/bash

if [ "$FLY_REGION" = "dfw" ] || [ "$FLY_REGION" = "ord" ]; then
  export LITEFS_CANDIDATE=true
else
  export LITEFS_CANDIDATE=false
fi

exec litefs mount

And then in your litefs.yml:

lease:
  candidate: ${LITEFS_CANDIDATE}

(Pardon my terrible bash scripting)

ben-davis commented 1 year ago

I don't have anything particularly useful to say, but just wanted to mention that I've been using litefs on https://coolstuff.app for 6-ish months now with essentially zero issues (beyond bugs I introduced myself with things like fly-replay and some minor issues pre 0.3).

We're still in private beta so traffic is low, but regular usage happens in two regions on v1 apps (syd and yyz and will be adding Europe soon), so it's not a super trivial setup. Our primary is in yyz, with fly-replay used to redirect mutable requests and we run background workers which handle all their mutation via litefs file descriptor locking (I wrote a simple Python library to handle this which I've been meaning to open source). Beyond that, it's just regular SQLite, so it all just works.

I know it's still pre-production technically, but litefs has been solid for me so far and have no regrets using it.

benbjohnson commented 1 year ago

Thanks for the feedback, @ben-davis. 🙏 I just wanted to say that coolstuff.app is super snappy! I'm glad to hear LiteFS is working well for you.