Keep an eye on GCP billing

Billing Report for GCP project: elasticc-challenge (updated about once per day)

We need to keep an eye on how much we're spending on Google Cloud as we test this code and run the actual challenge. We should check the billing reports regularly.

In addition, before running a test, get clear about what you expect to happen. Here are some questions to ask yourself:

which modules are you trying to test?
what is actually deployed? in other words, what resources exist in GCP that will be triggered? (Pub/Sub streams, BigQuery tables, Cloud Run instances, VMs, etc.)
how many alerts are you sending thru? (using a large number of alerts can be expensive and should not be done unless we specifically need to test how the module(s) will handle the load)

After running tests, check the billing report for anything that's unexpected.

During the actual challenge, we should review the billing to make sure it's not wildly different than what we expect, and to see if there's simple changes we could make to save money.

Note: I am posting this here because it contains information about specific things that can be expensive in GCP, and thus it may be useful for others in the future.

@hernandezc1 You should go look at the billing report (linked above) now, and specifically look at the charges on Sept 10 -- that is the day you ran a test that ingested 1.4 million alerts from ELAsTiCC's (test) Kafka stream, and cost about $150. Notice:

how much more that test cost than everything else we’ve done in that project so far, combined
the breakdown of charges. The linked page has the charges grouped by service. you may also want to group them by SKU, which gives you more detailed info about the charges for each service.

Don’t worry about having done this already on the Sept 10 test. The fact that you ran one test that cost $150 is not a big deal (ask me about my $300 test). But 1) that test didn’t need to cost nearly that much, and 2) similar mistakes can quickly rack up costs that are even much larger than this. So that’s why I want you to look at/understand the details of what happened during that test, and why I will elaborate more below:

We should have done that test with only the consumer module running (VM + Pub/Sub stream). We shouldn’t send that many alerts into modules unless we actually need to test how the modules will respond to that kind of load. One of the things I’ve noticed that can get expensive fast on Google Cloud is trying to do something that doesn’t work, many many times.

On Sept 10, we had 4 Cloud Functions running as part of that test. Three of them contained bugs that made them crash for every single alert. The forth (supernnova) had several problems that I'm guessing resulted from Cloud Functions quotas, which do not allow you to run that much data thru in such a short time. The most expensive problem was this: The runtime for the supernnova Function to process a single alert went way up to almost a full minute (because it had to wait for a quota to renew, I assume). You can see that if you look at the “Execution time” metrics for the Function (in the Console). You can see how expensive that runtime was if you change the “Group by” drop down on the billing report to “SKU”, and then look for the SKU called “CPU time” (almost 3/4 the total cost).

mwvgroup / pittgoogle-user-demos

Keep an eye on GCP billing #4