ito-org / api-backend

Backend for ito
Other
9 stars 5 forks source link

Scalability discussion #11

Open Dionysusnu opened 4 years ago

Dionysusnu commented 4 years ago

General issue for discussion about scalability and other issues that come with it

kiwo commented 4 years ago

current idea (TCN) is put a CDN in front of the app. the backend itself doesn't need that much.

kreativmonkey commented 4 years ago

https://github.com/awesome-selfhosted/awesome-selfhosted#faasserverless

Edit: this link is to inform about different server less opensource solutions.

raethlein commented 4 years ago

The dp3t whitepaper mentions an estimated data size, maybe those numbers help to come to a data-driven decision (especially with respect to database size and upload traffic):

Smartphones download a manageable amount of data every day. We assume that the contagious window of infected patients is on average 5 days. For 40.000 new infections per day (approximately the cumulative total of 5 European countries at their contagion peak) smartphones then need to download 110 MB each day. For a smaller country, such as Switzerland, with 2.000 infections a day (at contagion peak), smartphones need to download 5.5 MB each day (assuming a 14 day history).

Addono commented 4 years ago

My take at least for deploying the backend: A managed Kubernetes cluster on a public cloud provider, e.g. GCP, AWS, Azure or DigitalOcean. It's not especially cheap, but very reliable and flexible.

How do we make sure the API can handle the user loads, once the app goes live?

Deploy with horizontal autoscaling, both for the nodes (cluster size) and deployments (containers running the application). In addition, I would add a circuit-breaker on top of the backend. An additional caching layer could either be done in the cluster, or be handled by a dedicated CDN provider.

How does the database cope with it?

Probably hardest to scale, there's a Helm chart for high availability https://github.com/bitnami/charts/tree/master/bitnami/postgresql-ha. (A managed Postgres DB could also be an option as to offload the complexity of deploying and scaling the DB to a specialized party, however they tend to become very expensive.)

How do we protect against DDoS attacks? Do we use a service like CloudFlare?

Rate limiting at Ingress level, e.g. https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#rate-limiting. Additionally, something like CloudFlare could definitely help as the DDoS protection is in their free tier. Downside is that all traffic goes through their servers, so no clue if that's a blocker w.r.t. privacy.

MyDigitalLife commented 4 years ago

I would also recommend K8S for application hosting, it will make scaling fairly easy. There are alternative like server less like AWS Lamda that should also be looked at.

Regarding to the scalability of PostgreSQL I would mostly recommend not hosting it in the k8s cluster but using the hosted option by the cloud provider. It takes a lot of maintenance work from you like updating and backups.

From what I understand from the current database it won't be very complex but in case many people use the app the database will get bombarded by requests. A good solution for that would probably to add a Redis cache between the application and the database. That should take of the load and scale's pretty well.

Addono commented 4 years ago

I would also recommend K8S for application hosting, it will make scaling fairly easy. There are alternative like server less like AWS Lamda that should also be looked at.

I love serverless, however Lambda can become very expensive as the API Gateway costs ramp up quickly with high numbers of invocations.

From what I understand from the current database it won't be very complex but in case many people use the app the database will get bombarded by requests. A good solution for that would probably to add a Redis cache between the application and the database. That should take of the load and scale's pretty well.

Such an approach - where the application is aware of the caching - reminds me of how Netflix solves this problem. Basically, the application pushes changes to their caching service, which in turn is in charge of allowing all incoming requests to be served.

Their tech stack is mature and proven to scale- they have been using it at scale for several years now. However, it's also rather complex and only works for data which can be "produced" - you need to pre-compute all cached data as there's no way to back-propagate cache misses. https://netflixtechblog.com/re-architecting-the-video-gatekeeper-f7b0ac2f6b00

MyDigitalLife commented 4 years ago

I had a simpler cache mechanism in mind. Just a query result cache. Basically if you want to do a query, first you do a request to a key in the Redis cache. If the key exists you return the data stored with that key, if not you do the query on the DB server. When you get a result form the DB server you write that data to the Redis cache and with the key and a invalidation time. AWS has some some information about this: https://d0.awsstatic.com/whitepapers/Database/database-caching-strategies-using-redis.pdf

The result of this will be as followed: If your application has 10000 request a minute and every request does a call to the DB you get 10000 requests on that DB. With the Redis cache you could cache that result for 10 seconds. That way the requests to the database would be 1 every 10 seconds, so that would be only 6 requests per seconds.

This is just an example and not realistic but its a basic way of make sure a database doesn't get overwhelmed by your own application. Redis really fast at returning these results as all the data is stored in memory and it has nice scaling options.

The Netflix way of doing it is probably better, but also a lot more complex. I would start out with a easy cache solution and see if that words and only add complexity if needed.

Addono commented 4 years ago

+1 for referencing patterns. I agree that a simple solution is preferable until it reaches its scalability issues.

One issue which I do not see how it's explicitely covered in AWS' whitepaper is how to make that when there's a cache miss similar concurrent request will not also hit the DB. E.g. if you have 10k requests/second all hitting the same cache key and on a cache miss it takes 0.1s to resolve the query, then there's a .1s window in which all requests will hit the DB in an attempt to repopulate the DB. without any measures 10k * .1 = 1k requests will hit your DB - which scales linearly with the amount of traffic and db performance degradation. The write-through cache would not have this issue, however then you still need to implement something which writes all possible queries to the cache, for which no design patterns are offered.

Anyway, I don't see any real blockers, just unanswered questions for each of which already various solutions come to mind. Not something to worry about right now.

rbu commented 4 years ago

One issue which I do not see how it's explicitely covered in AWS' whitepaper is how to make that when there's a cache miss similar concurrent request will not also hit the DB. E.g. if you have 10k requests/second all hitting the same cache key and on a cache miss it takes 0.1s to resolve the query, then there's a .1s window in which all requests will hit the DB in an attempt to repopulate the DB. without any measures 10k * .1 = 1k requests will hit your DB - which scales linearly with the amount of traffic and db performance degradation.

Good catch! This problem (which is a version of the Thundering Herd Problem) is most easily solved at the cache layer, so depending on where you implement caches, it might be possible to use an existing solution. For example, if you're running nginx as a reverse proxy cache, you'll probably use proxy_cache_lock and proxy_cache_use_stale. Unfortunately, I don't think AWS ELB/CloudFront supports this.

Alternatively, given you have implemented retry and exponential backoff on the client side, the server can afford to not answer / defer answering a request. This allows you to use: