Switch to using S3 to store file and publishedFile data

mozilla / publish.webmaker.org

The teach.org publishing service for goggles and thimble

Mozilla Public License 2.0

16 stars 38 forks source link

Switch to using S3 to store file and publishedFile data #233

Open gideonthomas opened 8 years ago

gideonthomas commented 8 years ago

We're straining the db quite a bit with I/O for binary blobs. It would be a good perf boost if we instead upload all the blobs to S3 and store links to them instead which can be used by the client to download that data for the project.

cc @cadecairos

shubham0794x commented 7 years ago

@gideonthomas Sir , I would like to contribute on fixing this issue.

Pomax commented 7 years ago

I want to advise caution here - just moving things to S3 does not solve the actual problem of storing full blobs every single time. Offloading to S3 just passes the buck: now we have an AWS bucket that requires monitoring for stale data, with an extra level of indirection.

In the hopes of fixing the problem, rather than the symptom, what is the actual problem that is causing the database problems?

/cc @humphd

cadecairos commented 7 years ago

S3 is cheap enough for us to not really worry about stale data - we've got millions of objects in S3 and I'm pretty sure the bill is under ten bucks.

gideonthomas commented 7 years ago

The problem is the restrictions on the number of concurrent connections to the db while writing files, which blocks it from handling other requests. The same goes for the server handling the requests. I don't believe S3 has the same concurrency issues (but maybe @cadecairos can confirm) and can handle load a lot better than our pg database and will allow us to handle more server requests since the upload will now happen from the client directly leaving room for the server to handle other requests.

humphd commented 7 years ago

Scaling concurrent writes is usually done with a queue of some kind, and AWS has a bunch of these available. Whatever our eventual backend, why don't we just use a queue to allow Thimble to send files to be saved, and then leave it with AWS to get to it?

gideonthomas commented 7 years ago

I'm not entirely sure about this but concurrent writes on S3 are managed by S3 itself (whether as a queue or some other strategy) which is why I think we wouldn't have to worry about concurrency at all. If our backend is something else however, as you mention, to scale it, we would have to use a queue like SQS or maintain one on the server to manage load which means that we would have to configure / manage it. I would rather have S3 deal with it on its own (assuming that it does) than write logic to manage and use a queue to scale concurrent connections.