golemfactory / concent-deployment

Scripts and configuration for Concent deployment
5 stars 8 forks source link

Scale down pod and node redundancy to reduce resource usage on development clusters #377

Open cameel opened 5 years ago

cameel commented 5 years ago

Current clusters are too big for the amount of use they actually get. All of them have multiple nodes and multiple pod instances. The primary motivation was to be able to catch problems caused by stuff running in parallel and pods being rescheduled to different nodes. Unfortunately doing it on every cluster adds up quickly. Now that the development is slowing down, this is not as important and we can cut back on resources to reduce the cost.

Note: It's best to do #376 first.

  1. Reduce the number of pod instances on every develoment cluster:
    • One concent-api instance with 3 gunicorn workers.
    • One conductor and gatekeeper instance with 2 gunicorn workers.
    • One instance of each pod running a RabbitMQ worker.
  2. Use smaller machines for nodes used in the cluster (less CPU and RAM). If you're already using the smallest ones and they're not fully in use, reduce the number of nodes.
    • See if it's better to run multiple small machines or one bigger. If the difference is significant, choose the cheaper option. If not, prefer multiple machines.
  3. Reduce the sizes of disks attached to the machines.
    • Machines in kubernetes clusters often have 100GB disks attached. The only pod which stores significant amounts of files is the verifier. All the others would run just fine on machines with smaller disks. Try to attach a larger disk to only one of the nodes and set verifier's affinity to always run on that node.
  4. Run integration tests and see if the new configuration works well enough. If you see some severe performance problems or timeouts, increase the numbers little to make it work better.
  5. If the final configuration differs from the above, write a comment explaining what you actually did.
cameel commented 5 years ago

@bartoszbetka What's the status of this task? What has been done and what remains to be done?

cameel commented 5 years ago

@bartoszbetka I looked at the billings more closely and here are some more things that we could cut down on:

bartoszbetka commented 5 years ago

Steps that already done: 2, part of 3 and 4

cameel commented 5 years ago

Could you be more specific?