superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
4.08k stars 97 forks source link

Object Storage #327

Open benbjohnson opened 1 year ago

benbjohnson commented 1 year ago

Many applications need to store and propagate files that are not SQLite databases (e.g. images). While it's possible to store large binary data in SQLite, it is not efficient. A better approach would be to support non-SQLite files on their own.

Files can be bundled using the same LTX format, however, they will contain a single page of all the data in the file. Files must be saved in their entirety and atomically. For files written via FUSE, this corresponds to when a file handle is closed. We may also want to provide a safer atomic HTTP API as the FUSE approach could still have half-written files if a process dies while writing.

kentcdodds commented 1 year ago

still have half-written files if a process dies while writing.

Very good issue I hadn't considered. Do you suppose it would be possible to detect this situation and auto-delete the half-written file?

benbjohnson commented 1 year ago

We could require fsync() to finalize the file. That would be a good way to detect a fully written file and then we could discard partial files on close.

markuswustenberg commented 1 year ago

I'm surprised this is within the scope of LiteFS. I'm trying to understand the use case. Why would you use LiteFS for this, and not use something like S3 directly?

kentcdodds commented 1 year ago

I don't so much care what tech is behind object storage (provided it works well). I'm more interested in limiting the number of service providers I'm using. So if fly says they can host files for me then I'm in.

gedw99 commented 1 year ago

this is a great idea..

Ue cases :

  1. I need to store and process images, and i need that URI or CID to that image to be stored in s3 / Minio so i can relate the image to some data in sqlite.

Running minio and litefs together is a great combo.

You can mount volumes and map that to minio. You need a minimum of 4 drives for proper redundancy with minio, ad so thats 4 volumes. I do this on Hetzner as its cheap and they have 4 data centers spreads globally now.

On fly i think you can also mount volumes with the new machines system ? Can you resize a volume on Fly without disconnecting it ? will that be Ok with minio ?

https://docs.hetzner.com/cloud/volumes/faq

https://fly.io/docs/app-guides/minio/

markuswustenberg commented 1 year ago

@kentcdodds I can definitely understand that perspective. I certainly like simplifying my stack, which is what drove me to SQLite/LiteFS in the first place.

Mine is that I trust AWS S3 immensely with my data, perhaps more so than any other provider. So my optimal use case is LiteFS + backup to S3, and then just pure S3 for object storage. So for something like that, having a LiteFS layer in between seems redundant and just complicating matters. But I guess I'm missing the whole backstory here.

@gedw99 I get what you're saying, but I'm personally not interested in running something like Minio myself. Too much operational complexity for my taste. (It's easy until something goes wrong, IMO. šŸ˜‰)

So I guess I'm interested in an elaboration of the core issue, and what advantages this would bring compared to using S3 (or something with an S3-like API) directly?

gedw99 commented 1 year ago

@markuswustenberg Been running minio for ages. when you getting into high TB it saves you a lot of money.

if you not using high TB then its pointless, and you might as well just run on someone else's S3.

But fly does not have one so then what is the solution ?

markuswustenberg commented 1 year ago

@gedw99 Well, the solution for me is to just use AWS S3. šŸ˜Š

benbjohnson commented 1 year ago

@markuswustenberg This wasn't originally in scope for LiteFS but we've had so many people try to build object storage into SQLite on LiteFS that it seemed useful to better support it. I think S3 is great and it is probably the right solution for a lot of people. However, if you don't have a huge number of objects, storing and serving locally is a lot easier to setupā€”especially for someone not familiar with the AWS ecosystem.

kentcdodds commented 1 year ago

How far can we take this? At what point could we say that LiteFS can handle "a huge number of objects" or is that outside of scope for the future?

benbjohnson commented 1 year ago

The biggest limitation probably isnā€™t technical but cost. Having all your objects stored on all your nodes is going to cost real money at a certain scale. Iā€™m not sure what that scale is exactly. S3 has pretty cheap storage costs but their bandwidth costs are high. Do you have a ballpark idea of how many objects youā€™re thinking?

kentcdodds commented 1 year ago

I'm just trying to understand the trade-offs. In the context of the Epic Stack, if folks can start their new app idea out building on top of this instead of signing up for thirty different services and stay running like that long enough to prove out their idea then that's a real win.

benbjohnson commented 1 year ago

That's the approach I'm taking with it too. As a back of the envelope calculation, let's assume the objects are images that average 1MB each (which seems high). That's 1GB per 1,000 objects. If they're replicating out to 3 nodes and they're paying $0.15/GB/mo for volumes then that's $0.45 per 1,000 images per month plus any bandwidth to replicate between nodes.

I think that once someone gets to 10,000 or 100,000 objects then they can begin to worry about costs and think about moving their objects to S3.

kentcdodds commented 1 year ago

Yeah, so the path from LiteFS to S3 should be well paved (unless Fly introduces a more formal offering in the future ā€¼ļø), but starting with LiteFS is the simplest by a long shot. And actually has some nice benefits of globally distributed files as well (which AFAIK s3 does not have).

markuswustenberg commented 1 year ago

It's very interesting to read your perspectives. From mine, the operational and cognitive load on "just" using S3 is much smaller than running it on top of the storage layer, but I can see why you could think differently.

Regarding cost, S3 storage is IMO really cheap especially with intelligent tiering turned on. Bandwidth is expensive, agreed. But at a larger scale, I would probably look into putting a CDN or something like Cloudflare's R2 in front anyway.

Looking forward to following this issue.

gedw99 commented 1 year ago

It's very interesting to read your perspectives. From mine, the operational and cognitive load on "just" using S3 is much smaller than running it on top of the storage layer, but I can see why you could think differently.

Regarding cost, S3 storage is IMO really cheap especially with intelligent tiering turned on. Bandwidth is expensive, agreed. But at a larger scale, I would probably look into putting a CDN or something like Cloudflare's R2 in front anyway.

Looking forward to following this issue.

If you go with Cloudflare, baclblaze offer an AWS S3 compatible API with free transfers in and out of Cloudflare. Itā€™s about 4 times cheaper than being on AWS S3.

Itā€™s also faster than AWS from what I read.

It would be cool if fly.io joined the bandwidth alliance . Then you could run the db ( litefs) on fly with backblaze for storage.

Also more and more are getting into arrow / flight sql as a db backed by S3 also. Again if gmt.io did the bandwidth alliance thing it would be great ..

I run arrow and flight sql on fly.io now. With litefs itā€™s a great match because you can CDC you data into Arrow. Am not doing that yet.

Seafowl is one of the many arrow based systems: https://seafowl.io/docs/getting-started/tutorial-fly-io/part-2-deploying-to-fly-io

https://www.cloudflare.com/bandwidth-alliance/

khrome83 commented 1 year ago

Honestly, I love this idea. Looking at both NocoDb and Pocketbase.io as a way to provide a admin to the database. Both store images on the file system. While they offer S3 or MiniIO, i would much rather have the images closer to the edge. If I ever wanted to do image processing, it would be great.

MiniIO i read is pretty expensive to run as a separate app, because its doing more than just replicating the files. The thing I am loving about LiteFS and distributed Sqlite, is that I am not spinning up additional apps, it simplifies things a lot. So bringing in self hosting MiniIO, or even using a AWS account just for S3, adds complexity I don't want personally. S3 is great, use it for my job all the time. But for the things I build outside work, I want simple. Given how LiteFS working, file transfers seems like a good fit personally.

effulgentsia commented 1 year ago

For what it's worth, I'm really excited about this feature and look forward to when it gets implemented.

In addition to the use case of apps that allow users to upload images and other files, here's another use case you might not have yet considered. Some apps include an automatic updater. For example, WordPress has one, and Drupal is working on one. Other content management systems might have one too. I'm one of the maintainers of Drupal, so I'll use that as the example here just because that's the one I'm most familiar with.

One way to use Drupal's (upcoming) automatic updater on fly.io would be for the Docker image to contain everything needed to run a PHP/Composer app (e.g., nginx and fpm), but not contain the Drupal code. And instead for the Drupal code to be initialized on a volume and then treated as user data. Then, when the automatic updater runs and there's a Drupal update to install, it sets a record in the database that puts the site into maintenance mode, then updates the codebase that's on the volume, then runs any database update/migration functions required by the new code, then sets a record in the database that puts the site out of maintenance mode. While the site is in maintenance mode, regular incoming web requests don't get processed (instead a site is under maintenance message is returned), in order to avoid processing a request with a codebase that's in a partially updated state.

What I think is exciting about adding object storage into LiteFS is that as I understand it, one would then be able to put the whole code directory into LiteFS, and this codebase update would get correctly propagated to replica machines. Meaning the order of the update process would be preserved due to the sequence preservation of the LiteFS transactions: first the replicas would get the maintenance mode record set, then the replicas would get the file updates, then the replicas would get the database schema updates, and finally the replicas would get the maintenance mode record unset. Whereas trying to preserve this order of operations without LiteFS orchestrating all of it (i.e., trying to replicate the file changes and the database changes separately) would be a pain.

let's assume the objects are images that average 1MB each (which seems high). That's 1GB per 1,000 objects...I think that once someone gets to 10,000 or 100,000 objects then they can begin to worry about costs and think about moving their objects to S3.

In the case of code files (like PHP files), average file size is more like 1KB - 10KB, so a 300MB codebase could have more than 100k files, with the economics ($0.15/mo per GB per replica) still easily favoring LiteFS so long as LiteFS can handle this amount of objects.

Also, having all of this in LiteFS Cloud would be awesome, because if you need to do a point-in-time restore, you'd be restoring the database and the codebase together, in a mutually compatible state. Whereas restoring an old state of the database but on a new state of the codebase would be undesirable.

nickchomey commented 1 year ago

I use WordPress and would like to do something similar as @effulgentsia. Though I suspect it would be trickier than with Drupal, particular the db migration scripts. Still, such a mechanism built into litefs would make it easier.

I'm content to use cloudflare r2 for actual object (eg media) storage though - they handle global replication etc... I frankly don't understand why people use S3 when it is considerably more expensive...

gedw99 commented 1 year ago

MiniIO i read is pretty expensive

https://garagehq.deuxfleurs.fr

https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/

I change over to this. Was using Minio. Garage is rust based and uses a simpler way to do replications

It has a fuse mount too: https://garagehq.deuxfleurs.fr/documentation/connect/fs/