mwvgroup / Pitt-Google-Broker

A Google Cloud-based alert broker for LSST and ZTF
https://pitt-broker.readthedocs.io/en/latest/index.html
4 stars 0 forks source link

Potential issues at LSST scale #96

Open troyraen opened 3 years ago

troyraen commented 3 years ago

Pipeline performance

See info in the comment below.

Cloud Function quotas

TLDR: projecting ~45 minute delays, a few times each night

See Quotas, esp the section on background functions. Several have the potential to cause problems. Looking at one: "Max throughput limit"...

This means:

But LSST's rate will easily surpass this:

If ZTF is any indication... the instantaneous alert rate will often be higher than this. A quick glance at the recent ZTF dashboard shows publish rates of >2x their equivalent active-night average (5x10^5 alerts / 10 hours) for 2-10 minute bursts a few times per night.

Extrapolating to LSST:

wmwv commented 3 years ago

Ask Ross about this technical question. This must be solved in some way.

10 MB/s is a ridiculously low limit for lots of things that must be running on GCP; do none of those use Cloud Functions.

It's also possible this limit is more security policy minded, to avoid denial of resource problems, than it is a fundamental technical limitation. There must be ways then to configuration the limits differently.

troyraen commented 3 years ago

It might be that we should move these components to Cloud Run at that point. I have been spending a little time learning about that service recently. Yes, I will ask Ross.

troyraen commented 3 years ago

Pipeline performance at LSST rates

On Sept 23, 2021 the incoming alert rate from ZTF spiked to near-LSST rates. I have documented the pipeline's performance here, including many figures. I've copied only the takeaways below:

wmwv commented 3 years ago

Are you calculating cost/alert or cost/night?

troyraen commented 3 years ago

I have in the past, but I'm not doing so on an ongoing basis.

I look at the GCP Billing reports every so often, but this is not an efficient way to calculate either of those things (partly because a single observing night gets split between 2 billing days).

There is probably a programatic way to access billing info, but I haven't looked it.

wmwv commented 3 years ago

My question was "what do you mean by 'cost'"? How do you calculate that quantity?

troyraen commented 3 years ago

Oh, I was calculating cost/night by looking at the billing reports.

I think I was not being very careful at the time about accounting for the fact an observing night gets split between two billing days. This might explain why I quoted two different numbers (10x and >6x). I would have to check again more carefully.

wmwv commented 3 years ago

Cost/alert will be a good quantity to calculate on a continuing basis. Estimates are fine.

I'm sure there is a programatic way to access billing and that would be a nice addition to our Broker dashboard.

But a semi-manual way is fine for now: One query to get number of alerts / billing day, and then a second query/download to get the cost / billing day.

troyraen commented 2 years ago

The Cloud Functions that store to BigQuery (BigQuery and SuperNNova) experience a large number of timeouts (which then get retried) when the incoming rate is high. I assume we are hitting a rate limit for streaming inserts, but I haven't checked.

This was probably due to the fact that we make a get_table request with every streaming insert, and there is a limit of 100 API requests per second per user per method (does not apply to streaming inserts). This relevant streaming insert limit is 1 GB per second per project, and this shouldn't have been more than about 23 MB/second. Can't check the logs anymore because they're only stored for 30 days in GCP by default, and we haven't changed the defaults or exported logs. (BigQuery quotas)