mozmeao / infra

Mozilla Marketing Engineering and Operations Infrastructure
https://mozilla.github.io/meao/
Mozilla Public License 2.0
59 stars 12 forks source link

GCP Cost Optimization #1292

Closed duallain closed 4 years ago

duallain commented 4 years ago

Are we autoscaling nodes? Are they the right composition of memory/cpu? Is the bedrock deployment scaled appropriately?

duallain commented 4 years ago

Bedrock prod is by far the largest user of resources in the cluster. So, we should likely fit the cpu/memory ratios of the cluster to the pods.

Current bedrock prod pod limits are:

limits:
  cpu: 1500m
  memory: 1000Mi

(A snapshot of a nodes usage right now, april 30th ~12:30 pm pacific) (used| available | used | available) 2.54 CPU | 3.92 CPU | 2.48 GB | 13.97 GB |  

This is our total cluster availability 64 vCPUs | 256.00 GB @ 16 nodes

Based on Haswell being the default platform at the moment in us-central per https://cloud.google.com/compute/docs/regions-zones and this page: https://cloud.google.com/compute/docs/cpu-platforms listing haswell at 2.3Ghz (2300 mhz).

Based on all that, a node with 4 vcpus has 9200 mhz available, and 16 gb = 16384 mb. We are allocating nodes with approx twice as much memory as cpu cycles. But, our pod limts are requesting more cpu than memory for bedrock. To save money we can change our memory from 16gb to 8gb to create a more balanced cluster.