Potential issues at LSST scale

troyraen commented 3 years ago

Pipeline performance

See info in the comment below.

Cloud Function quotas

TLDR: projecting ~45 minute delays, a few times each night

See Quotas, esp the section on background functions. Several have the potential to cause problems. Looking at one: "Max throughput limit"...

10 MB per second: Max throughput of incoming events (Pub/Sub messages), per Cloud Function. This GCP quota cannot be increased.
80 KB per alert: projected size of LSST alert packets

This means:

125 alerts per second: Max processing rate for any Cloud Function listening to our alerts Pub/Sub stream. This includes BigQuery and file storage components plus some of the science filters.

But LSST's rate will easily surpass this:

278 alerts per second: average LSST alert rate on the most active nights (10^7 alerts per night / 10 hours per night).

If ZTF is any indication... the instantaneous alert rate will often be higher than this. A quick glance at the recent ZTF dashboard shows publish rates of >2x their equivalent active-night average (5x10^5 alerts / 10 hours) for 2-10 minute bursts a few times per night.

Extrapolating to LSST:

~45 minutes: length of time the last alerts in an equivalent 10 minute burst would wait before being processed by a single Cloud Function. (2 x active-night average x 10 minutes / max processing rate)

wmwv commented 3 years ago

Ask Ross about this technical question. This must be solved in some way.

10 MB/s is a ridiculously low limit for lots of things that must be running on GCP; do none of those use Cloud Functions.

It's also possible this limit is more security policy minded, to avoid denial of resource problems, than it is a fundamental technical limitation. There must be ways then to configuration the limits differently.

troyraen commented 3 years ago

It might be that we should move these components to Cloud Run at that point. I have been spending a little time learning about that service recently. Yes, I will ask Ross.

troyraen commented 3 years ago

Pipeline performance at LSST rates

On Sept 23, 2021 the incoming alert rate from ZTF spiked to near-LSST rates. I have documented the pipeline's performance here, including many figures. I've copied only the takeaways below:

Our consumer can handle the average alert rate expected from LSST (\~17,000 alerts/min). It takes us \~8 minutes to get the ZTF alert backlog into the alerts stream. It's hard to tell how long it takes ZADS to dump these alerts, so I don't really know what our consumer's latency was relative to our actual incoming alert rate.
The BigQuery storage Cloud Function gets backed up; processing times for this component (non-cumulative) spike up to \~15 minutes.
All other Cloud Functions seem to handle the high alert rate pretty well, with SuperNNova a possible exception.
The execution time of an individual Cloud Function instance is higher when there are more simultaneous instances (i.e., incoming alert rate is higher). This seems strange to me. But it does explain the 10x billing increase (lots of instances with long execution times).
The Cloud Functions that store to BigQuery (BigQuery and SuperNNova) experience a large number of timeouts (which then get retried) when the incoming rate is high. I assume we are hitting a rate limit for streaming inserts, but I haven't checked.
The combination of many simultaneous Cloud Function instances and their large execution times results in high costs (>6x normal ZTF).

wmwv commented 3 years ago

Are you calculating cost/alert or cost/night?

troyraen commented 3 years ago

I have in the past, but I'm not doing so on an ongoing basis.

I look at the GCP Billing reports every so often, but this is not an efficient way to calculate either of those things (partly because a single observing night gets split between 2 billing days).

There is probably a programatic way to access billing info, but I haven't looked it.

wmwv commented 3 years ago

My question was "what do you mean by 'cost'"? How do you calculate that quantity?

troyraen commented 3 years ago

Oh, I was calculating cost/night by looking at the billing reports.

I think I was not being very careful at the time about accounting for the fact an observing night gets split between two billing days. This might explain why I quoted two different numbers (10x and >6x). I would have to check again more carefully.

wmwv commented 3 years ago

Cost/alert will be a good quantity to calculate on a continuing basis. Estimates are fine.

I'm sure there is a programatic way to access billing and that would be a nice addition to our Broker dashboard.

But a semi-manual way is fine for now: One query to get number of alerts / billing day, and then a second query/download to get the cost / billing day.

troyraen commented 2 years ago

The Cloud Functions that store to BigQuery (BigQuery and SuperNNova) experience a large number of timeouts (which then get retried) when the incoming rate is high. I assume we are hitting a rate limit for streaming inserts, but I haven't checked.

This was probably due to the fact that we make a get_table request with every streaming insert, and there is a limit of 100 API requests per second per user per method (does not apply to streaming inserts). This relevant streaming insert limit is 1 GB per second per project, and this shouldn't have been more than about 23 MB/second. Can't check the logs anymore because they're only stored for 30 days in GCP by default, and we haven't changed the defaults or exported logs. (BigQuery quotas)

mwvgroup / Pitt-Google-Broker