Scale down pod and node redundancy to reduce resource usage on development clusters

cameel commented 5 years ago

Current clusters are too big for the amount of use they actually get. All of them have multiple nodes and multiple pod instances. The primary motivation was to be able to catch problems caused by stuff running in parallel and pods being rescheduled to different nodes. Unfortunately doing it on every cluster adds up quickly. Now that the development is slowing down, this is not as important and we can cut back on resources to reduce the cost.

Note: It's best to do #376 first.

Reduce the number of pod instances on every develoment cluster:
- One concent-api instance with 3 gunicorn workers.
- One conductor and gatekeeper instance with 2 gunicorn workers.
- One instance of each pod running a RabbitMQ worker.
Use smaller machines for nodes used in the cluster (less CPU and RAM). If you're already using the smallest ones and they're not fully in use, reduce the number of nodes.
- See if it's better to run multiple small machines or one bigger. If the difference is significant, choose the cheaper option. If not, prefer multiple machines.
Reduce the sizes of disks attached to the machines.
- Machines in kubernetes clusters often have 100GB disks attached. The only pod which stores significant amounts of files is the verifier. All the others would run just fine on machines with smaller disks. Try to attach a larger disk to only one of the nodes and set verifier's affinity to always run on that node.
Run integration tests and see if the new configuration works well enough. If you see some severe performance problems or timeouts, increase the numbers little to make it work better.
If the final configuration differs from the above, write a comment explaining what you actually did.

cameel commented 5 years ago

@bartoszbetka What's the status of this task? What has been done and what remains to be done?

cameel commented 5 years ago

@bartoszbetka I looked at the billings more closely and here are some more things that we could cut down on:

I see that we're paying $10-15/month just for static IPs. They're supposed to be free as long as they're attached to machines. Please check which IPs we're paying for and either delete them or (if we do need them) tell me what they're for and temporarily attach them to some machine(s). Are these the mainnet IPs?
concent-staging-storage disks are too big. 1.5 TB in total. Given how much they're actually used, I think that 30-50 GB would be more than enough.
- By the way, please rename the disks and other stuff used for dev so that it's clear what they belong to. For example concent-storage should really be called concent-dev-storage.
The instances used for clusters running old versions should be smaller. For example right now both staging clusters use 4 cores which is a waste - the older one could be the size of the dev cluster. The older version by design will receive less traffic. Especially on staging it won't be used much and is there only so that we can test support for multiple versions.
We really need to redeploy testnet with less machines, without geth and with less storage. It accounts for most of the cost.

bartoszbetka commented 5 years ago

Steps that already done: 2, part of 3 and 4

cameel commented 5 years ago

Could you be more specific?

golemfactory / concent-deployment

Scale down pod and node redundancy to reduce resource usage on development clusters #377