rivernews / slack-middleware-server

This server act as a middleware to communicate with Slack API.
1 stars 1 forks source link

Search for better cost-effficient cloud provider #62

Closed rivernews closed 4 years ago

rivernews commented 4 years ago

This ticket also deals with the vision of this project. If we want to scale while paying reasonable bill for cloud provider, we need a more flexible way to run a kubernetes cluster. And using Kubernetes as a service definitely is limiting our way on that path.

Ideally something like AWS Fargate will do the best - if we can lower the cost when our cluster is at idle, then we can afford more concurrency while scraper jobs fire up.

AWS route: more RAM w/ cost efficiency

Several requirements about provisioning on AWS

Elastic scale route: save cost

Approach 1: AWS Fargate, or any other container service

Approach 2: K8 auto-scaling, K8 API

Approach 3: manual slack command in K8s

This approach is supposed to be the most feasible, fast to start one. No need to seek for other platform. This approach aims to create a low cost node running SLK - SLK has to be up all the time, in order to receive manual scale up / scale down command. That is, this approach uses SLK as a platform to manually scale up and down. This should save us cost and prevent keeping a large-cost node running w/o any scraper job present.

rivernews commented 4 years ago

Plan for elastic scaling

If SLK can programmatically spin up a container for both selenium server and scraper job, that would be awesome. Github AWS SDK for javascript. AWS CDK for EC2 npm page. Spec for each container:

One thing we want to ask is, VPC and subnet, does they cost anything? If not, we can use tf to create them beforehand; Otherwise, we may want to include them into the dynamically creating resource logic in SLK.

Plan for more RAM

rivernews commented 4 years ago

Elastic Approach

Looks like elastic approach is probably the most cost efficient one. The idea is basically:

rivernews commented 4 years ago

K8S Elastic Approach

Looks like via nodehs DO client, the node pool and its nodes are quite troubling. Node inside stucks at "provisioning". Why is that?

Another method is to use tf to create a separate node pool of 0 node count. Then SLK will just update "count" when need to scale up. Then, do polling to check node readiness.

rivernews commented 4 years ago

Practical approach on k8s elastic

rivernews commented 4 years ago

Tweaking for appropriate node size

Configuration: 2vCPU, 4G RAM on selenium; SLK & rest of k8s infra on 1vCPU, 2G RAM. Running 1 concurrent scraper job

Primary node

scraper job CPU is neglect-able. image

Scraper job consumes around 177 MB. SLK uses 29 MB. We're not testing SLK because we're using local dev SLK. image

Worker node

Only running selenium. 1 scraper, uses around 0.8 CPU image

1 scraper, uses up tp 900MB. image


Services on Primary node like grafana getting a bit slow to response, so perhaps creating scraper job in worker node as well.


Problems

rivernews commented 4 years ago

Tuning Performance & Throughput

Several things we want to try:


Benchmarking

All k8s scraper job & selenium running on worker node.

Primary Node:

SLK: Initial: 60MB 4 sandbox processes: 175 MB (+115MB, 29MB/process) 10 sandbox processes: spark: 740MB (+680MB, 68MB/process); steady: 600MB (+540MB, 54MB/process)

Node memory usage: 2.1G/4G , around 50%. Estimated remaining capacity: safely +1G workload == at least 10 more sandbox == 20 total sandbox image

Worker Node: Selenium, 4 sessions: 1G-3G (250-750MB/session), average 2.3G (575MB/session). Java scraper container, 4 jobs: 180-200MB/per job, total 720MB-800MB; Actual (incl. overlapping time): 8 k8s jobs concurrently and total of 1.7G.

Node memory usage: spark: 5G/8G, steady 4.5G/8G. Estimated remaining capacity: safely +2G workload == 2-3 more k8s jobs == total 6-7 k8s jobs. image

rivernews commented 4 years ago

Problem

Tunning Profile

rivernews commented 4 years ago

Final Milestone

While there're a lot space for improvement, we could set a final milestone here just to achieve two things:

Steps

Side notes

rivernews commented 4 years ago

Standlone Approach

Things going well until we face the challenges here image

As you can see, there's a problem with k8s node assignment algorithm. Before the whole cluster went nuts, you can see besides the 2+2 job-switching overlap, which is dangerous too and we can lower the memory request, but there's an additional job assigned to this node, making it have 5 jobs running concurrently at that moment.

Looks like we can't trust k8s's node assignment. We can lower the memory request so that a job doesn't claim such much memory at the beginning, only claim more when a job needs it - we can do that. But we don't have control over node assignment.

Unless k8s has some additional parameter to configure this, we will need to implement this node distribution algorithm by our own.

Two ways:

rivernews commented 4 years ago

Challenges implementing anti-affinity by redis semaphore

The semaphore objects are possibly created across different node processes. Got two issues

Some ideas why previous work succeeded:

rivernews commented 4 years ago

Overlap memory usage issue

During job switch, looks like k8s job does not immediately release memory.

Some ideas to tackle this situation

Shape of one node image Another image Doesn't look like the overlapping issue is solved, but slightly better than previous. One guess is, the job "object" is deleted, but its pods remain there occupying resources.

Looks like it works perfectly now! image

And another node image

Final check

rivernews commented 4 years ago

Scaling SLK

Recap the workflow


The core node pool given 1v2G is too small, unstable. We got network and nginx crashed, even if SLK nodes are fine. We then have to upgrade core node pool droplet size. At the end it just lose the point - we can just simply use a 4v8G droplet for both core and SLK. And that's a firm $40 per month. You can scale the entire k8 down by terraform - right, it's not perfect in that sense though. I don't know, maybe we need a separate k8 cluster to "manage" the k8 used for scraper.

Also, we would like to populate many companies currently missing in our database, include:

rivernews commented 4 years ago

The scaling up and down is quite stable right now. Further automation would be really hard to maintain the same cost level around $20-40 monthly and can't really save us money while have the ability to scale up.

Summary

The current max capacity is 60 jobs at max, using core droplet size of 8G RAM. Could be either 4v8G, or memory-optimized 1v8G. Basic monthly bill is $40 at this memory size. But we figured out a way to bypass letsencrypt duplicated certificate, so we can always scale down the entire k8s cluster. Of course we still have to do this in terminal. Would be ideal to have a meta-service running, at least trigger travis job to trigger k8s provision / deletion. Perhaps heroku could be a good place to do this due to its free plan.

The scaling up cost is additional, using 2v4G machines.