Open gideonthomas opened 8 years ago
@gideonthomas Sir , I would like to contribute on fixing this issue.
I want to advise caution here - just moving things to S3 does not solve the actual problem of storing full blobs every single time. Offloading to S3 just passes the buck: now we have an AWS bucket that requires monitoring for stale data, with an extra level of indirection.
In the hopes of fixing the problem, rather than the symptom, what is the actual problem that is causing the database problems?
/cc @humphd
S3 is cheap enough for us to not really worry about stale data - we've got millions of objects in S3 and I'm pretty sure the bill is under ten bucks.
The problem is the restrictions on the number of concurrent connections to the db while writing files, which blocks it from handling other requests. The same goes for the server handling the requests. I don't believe S3 has the same concurrency issues (but maybe @cadecairos can confirm) and can handle load a lot better than our pg database and will allow us to handle more server requests since the upload will now happen from the client directly leaving room for the server to handle other requests.
Scaling concurrent writes is usually done with a queue of some kind, and AWS has a bunch of these available. Whatever our eventual backend, why don't we just use a queue to allow Thimble to send files to be saved, and then leave it with AWS to get to it?
I'm not entirely sure about this but concurrent writes on S3 are managed by S3 itself (whether as a queue or some other strategy) which is why I think we wouldn't have to worry about concurrency at all. If our backend is something else however, as you mention, to scale it, we would have to use a queue like SQS or maintain one on the server to manage load which means that we would have to configure / manage it. I would rather have S3 deal with it on its own (assuming that it does) than write logic to manage and use a queue to scale concurrent connections.
We're straining the db quite a bit with I/O for binary blobs. It would be a good perf boost if we instead upload all the blobs to S3 and store links to them instead which can be used by the client to download that data for the project.
cc @cadecairos