kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites
https://git.k8s.io/community/sig-k8s-infra
Apache License 2.0
690 stars 785 forks source link

Setup a budget and budget alerts #1375

Closed spiffxp closed 1 year ago

spiffxp commented 3 years ago

ref: https://cloud.google.com/billing/docs/how-to/budgets

Currently we review our billing reports at each meeting, which means we'll notice abnormalities within a 14-day window. As our utilization increases, it would be wise for us to use a budget and alerts to catch things sooner.

I tried experimenting with my account, and didn't have sufficient privileges. We should start there

/priority important-longterm /wg k8s-infra

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

spiffxp commented 3 years ago

/remove-lifecycle stale /assign @thockin I'm assigning you to get your input on whether you think this is worth investing time in

thockin commented 3 years ago

I think it is long-term valuable but not near-term

spiffxp commented 2 years ago

/remove-priority important-longterm /priority critical-urgent /milestone v1.23

We discussed last meeting that our spend looks like it's going to put us very near the threshold this year

It's time to come up with a plan for how to make sure we don't cross it, and how to detect if we are about to. Maybe it's not worth implementing technically with cloud budgets, but we should then at least know what number over what period is a flashing danger sign, and have some kind of framework / guidance for what to do next once we see it.

spiffxp commented 2 years ago

https://github.com/kubernetes/k8s.io/pull/2940 - Adds a monthly budget for k8s-infra as a whole, we'll get e-mail alerts if we hit 90% (225K) for the month (which we have been crossing continually since August, but with no alerts setup) and 100% (which we crossed once in August accidentally due to 5k node clusters hanging around for too long)

spiffxp commented 2 years ago

Our billing report doesn't do a great job of rolling up similar classes of projects, so I plugged the following into BigQuery

select
    sum(cost) as total_cost,
    invoice.month,
    case
        when regexp_contains(project.name, r'k8s-infra-e2e-boskos-[0-9]+') then 'e2e-gce'
        when regexp_contains(project.name, r'k8s-infra-e2e-boskos-gpu-[0-9]+') then 'e2e-gpu'
        when regexp_contains(project.name, r'k8s-infra-e2e-boskos-scale-[0-9]+') then 'e2e-scale'
        when regexp_contains(project.name, r'k8s-staging-.+') then 'staging'
        when project.name = 'k8s-infra-e2e-scale-5k-project' then 'e2e-5k'
        else project.name
    end as project_type
from 
    `kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
    billing_account_id = "018801-93540E-22A20E"
group by
    invoice.month,
    project_type
order by
    invoice.month desc, total_cost desc 

Then "explored in data studio" to come up with these charts (left is most recent):

Stacked (keep in mind $3M / 12mo = 250K/mo for our budget) Screen Shot 2021-10-15 at 6 38 19 AM

Regular Screen Shot 2021-10-15 at 6 39 06 AM

It's pretty clear our artifact hosting costs have been steadily growing. The 5k scale jobs pushed us over the limit in august, but even if we dropped those, we're going to hit our budget in a month if we do nothing about artifact hosting costs.

jhoblitt commented 2 years ago

What bin is egress bandwidth going into? Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?

spiffxp commented 2 years ago

What bin is egress bandwidth going into?

Egress is charged to the project hosting the artifacts being transferred, so regardless of which SKU it's billed against, it all goes against the k8s-artifacts-prod project

From https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/bPVn Screen Shot 2021-10-18 at 7 57 33 AM

Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?

select
sum(cost) as total_cost,
sku.description as sku,
sum(usage.amount_in_pricing_units) amount,
usage.pricing_unit pricing_unit,
invoice.month,
from 
`kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
billing_account_id = "018801-93540E-22A20E"
and project.name = 'k8s-artifacts-prod'
and usage.pricing_unit = 'gibibyte'
group by
invoice.month,
sku,
pricing_unit
order by
invoice.month desc, total_cost desc 

The units here are GB Screen Shot 2021-10-18 at 8 22 04 AM

From https://console.cloud.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22storage.googleapis.com%2Fnetwork%2Fsent_bytes_count%5C%22%20resource.type%3D%5C%22gcs_bucket%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22resource.label.%5C%22bucket_name%5C%22%22%5D%7D,%7B%22perSeriesAligner%22:%22ALIGN_NONE%22,%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D,%22pickTimeSeriesFilter%22:%7B%22rankingMethod%22:%22METHOD_MAX%22,%22numTimeSeries%22:%225%22,%22direction%22:%22TOP%22%7D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22,%22legendTemplate%22:%22$%7Bresource.labels.bucket_name%7D%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%226w%22%7D%7D&project=kubernetes-public

Bytes sent, top 5 by max value over the last 6W (I don't think our cloud monitoring retention goes further back than that) Screen Shot 2021-10-18 at 8 31 31 AM

I would defer to @BobyMCbobs and @Riaankl to provide a report on which specific artifacts are how large, and how often they're being transferred. That said, I think this is a problem of volume and not specific artifacts.

spiffxp commented 2 years ago

https://github.com/kubernetes/k8s.io/issues/1834#issuecomment-943836836 is our umbrella issue for mitigating artifact hosting costs by use of mirrors, which would allow us to mitigate costs due to large consumers by having them pull from mirrors located closer to them or on their own infra. The comment I'm linking posits that if we could use something like Cloud CDN we could also lower the cost of hosting regardless of where requests are coming from.

It is unclear whether this is possible for container images hosted at k8s.gcr.io which are the vast majority of bytes transferred, as they live in a subdomain of gcr.io that I'm not sure we can take ownership of (replace the endpoint), my understanding is it was provided to us internally

riaankleinhans commented 2 years ago

@jhoblitt we have a report on artifact traffic. This data run form 9 April till Sept 2021 There are several graphs and tables. Here is tables that might answer some of you questions: image

jhoblitt commented 2 years ago

@spiffxp Thanks for doing that extra analysis. I agree that this sounds more like a pure popularity problem rather than bloated artifacts. I'm not sure what a fitted slope works out to but I'm going to guess that transfers are going to grow faster than gcp bandwidth prices will decrease in the near term and will eventually exceed the total cost envelope. Has there been any discussion of moving away from gcr.io? I would easily believe it will take > 3 years to shift the majority of pulls over to a k8s project registry.

jhoblitt commented 2 years ago

@Riaankl I was wondering if there were large artifacts that could be put on a diet but nothing is showing up in the top 10.

riaankleinhans commented 2 years ago

With #1834 we aim to get 2-3 redicetor POC's up where by cloud providers could have local artifacts, and routing is affected by the Redirector based on the requesitng IP's ASN information. Therefore the load is spread to all providers. Idealy the complete set of aritfacts should be hosted by the participants. 80% of the traffic is related to <30 images. image

ameukam commented 2 years ago

/milestone v1.24

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam commented 2 years ago

/remove-lifecycle stale

ameukam commented 2 years ago

/milestone clear

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam commented 1 year ago

/remove-lifecycle stale /lifecycle frozen

BenTheElder commented 1 year ago

What exactly do we see as outstanding here?

spiffxp commented 1 year ago

spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1

FWIW I can't access this

What exactly do we see as outstanding here?

I agree with capping this off as the first pass. I think we'll want to revisit how we track our budget in the new year, and that should probably be a separate issue.

Things you might want to consider before capping this off:

I'll leave it to @ameukam or others to close if you're fine with this as-is.

BenTheElder commented 1 year ago

ACK -- Sorry that link should've been https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/tPVn

ameukam commented 1 year ago

I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.

/close

k8s-ci-robot commented 1 year ago

@ameukam: Closing this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/1375#issuecomment-1312285726): >I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.