Open troyraen opened 3 years ago
Ask Ross about this technical question. This must be solved in some way.
10 MB/s is a ridiculously low limit for lots of things that must be running on GCP; do none of those use Cloud Functions.
It's also possible this limit is more security policy minded, to avoid denial of resource problems, than it is a fundamental technical limitation. There must be ways then to configuration the limits differently.
It might be that we should move these components to Cloud Run at that point. I have been spending a little time learning about that service recently. Yes, I will ask Ross.
On Sept 23, 2021 the incoming alert rate from ZTF spiked to near-LSST rates. I have documented the pipeline's performance here, including many figures. I've copied only the takeaways below:
Are you calculating cost/alert or cost/night?
I have in the past, but I'm not doing so on an ongoing basis.
I look at the GCP Billing reports every so often, but this is not an efficient way to calculate either of those things (partly because a single observing night gets split between 2 billing days).
There is probably a programatic way to access billing info, but I haven't looked it.
My question was "what do you mean by 'cost'"? How do you calculate that quantity?
Oh, I was calculating cost/night by looking at the billing reports.
I think I was not being very careful at the time about accounting for the fact an observing night gets split between two billing days. This might explain why I quoted two different numbers (10x and >6x). I would have to check again more carefully.
Cost/alert will be a good quantity to calculate on a continuing basis. Estimates are fine.
I'm sure there is a programatic way to access billing and that would be a nice addition to our Broker dashboard.
But a semi-manual way is fine for now: One query to get number of alerts / billing day, and then a second query/download to get the cost / billing day.
The Cloud Functions that store to BigQuery (BigQuery and SuperNNova) experience a large number of timeouts (which then get retried) when the incoming rate is high. I assume we are hitting a rate limit for streaming inserts, but I haven't checked.
This was probably due to the fact that we make a get_table
request with every streaming insert, and there is a limit of 100 API requests per second per user per method (does not apply to streaming inserts). This relevant streaming insert limit is 1 GB per second per project, and this shouldn't have been more than about 23 MB/second. Can't check the logs anymore because they're only stored for 30 days in GCP by default, and we haven't changed the defaults or exported logs. (BigQuery quotas)
Pipeline performance
See info in the comment below.
Cloud Function quotas
TLDR: projecting ~45 minute delays, a few times each night
See Quotas, esp the section on background functions. Several have the potential to cause problems. Looking at one: "Max throughput limit"...
This means:
But LSST's rate will easily surpass this:
If ZTF is any indication... the instantaneous alert rate will often be higher than this. A quick glance at the recent ZTF dashboard shows publish rates of >2x their equivalent active-night average (5x10^5 alerts / 10 hours) for 2-10 minute bursts a few times per night.
Extrapolating to LSST: