m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Interactive jobRateLimitExceeded errors prevent daily jobs from completing #410

Closed stephen-soltesz closed 1 year ago

stephen-soltesz commented 1 year ago

The daily job for 10-02 for tcpinfo failed to process, resulting in alert:

Upon investigation, it appears that there was a transient rate limit exceeded error:

The logs discovered from gardener for 2022-10-02 ndt/tcpinfo job:

{
  "textPayload": "2022/10/03 13:10:13 actions.go:121: 20221002:ndt/tcpinfo Dedup googleapi: Error 400: Job exceeded rate limits: Your project_and_region exceeded quota for concurrent queries. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas, jobRateLimitExceeded",
  "insertId": "gu79gk90i80ti726",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "pod_name": "etl-gardener-universal-c896c94c-nx9f8",
      "namespace_name": "default",
      "project_id": "mlab-oti",
      "cluster_name": "data-processing",
      "location": "us-central1",
      "container_name": "etl-gardener"
    }
  },
  "timestamp": "2022-10-03T13:10:13.153122482Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-data-processing-default-c5ccf0b6-xor2",
    "k8s-pod/run": "etl-gardener-universal",
    "k8s-pod/pod-template-hash": "c896c94c"
  },
  "logName": "projects/mlab-oti/logs/stderr",
  "receiveTimestamp": "2022-10-03T13:10:17.580238386Z"
}

According to bigquery quotes queries are limited to 100 concurrent interactive jobs.

Using the mlab-oti.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT metadata table, we can estimate the frequency of jobs per minute to see if there are time periods that exceed this quota.

While, the rate is below 100 for project measurement-lab, there are regularly periods above that (>200) within mlab-oti. Those numbers greater than 100 may be the result of non-overlapping queries within the same minute, a limitation of this technique for estimation, or that the quota is flexibly enforced.

-- For every metadata row about queries between 10-02 and 10-04, count the number of concurrent interactive
-- query jobs in the same minute. i.e. estimate the total number of running queries per minute.

WITH minutes AS (
  SELECT *, TIMESTAMP_DIFF(end_time, start_time, MINUTE) as runtime_minutes
  FROM `mlab-oti.region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
  WHERE creation_time BETWEEN TIMESTAMP("2022-10-02") AND TIMESTAMP("2022-10-04")
  AND job_type = 'QUERY'
 -- AND state != 'RUNNING'
  AND priority = 'INTERACTIVE'
), generate_minutes AS (
  SELECT TIMESTAMP_TRUNC(TIMESTAMP_ADD(start_time, INTERVAL minute_offset MINUTE), MINUTE) as start, 1 as count
  FROM minutes, UNNEST(GENERATE_ARRAY(0, runtime_minutes)) AS minute_offset
)

SELECT start, sum(count) as total
FROM generate_minutes
GROUP BY start
ORDER BY start

Regardless, it appears that gardener operations use INTERACTIVE queries that compete for priority with user-queries. See: https://github.com/m-lab/etl-gardener/blob/689bcda01c5167a435258fd32d37df1c56e18b75/cloud/bq/ops.go#L104-L121

But, it should be possible to set the query priority to q.Priority = bigquery.BatchPriority, as seen in this example:

https://cloud.google.com/bigquery/docs/running-queries#batch

stephen-soltesz commented 1 year ago

It is not known how batch queries will impact gardener. The documentation suggests that only 10 concurrent batch jobs can run, but the number of queued can be higher. The phrasing is a little ambiguous. From: https://cloud.google.com/bigquery/docs/running-queries#batch

Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once. However, your project can run up to 10 concurrent batch queries.

stephen-soltesz commented 1 year ago

https://github.com/m-lab/etl-gardener/releases/tag/prod-v2.16.1