However, the function introduced the PR is only called in test code. This means that when auto-indexing fails a lot of times for a certain repo, we will still keep creating jobs anyways, unless someone actually manually modifies the codeintel_autoindexing_exceptions table manually. For example, we just had an incident recently where a customer had 100K+ auto-indexing jobs, and the queue kept growing. This caused high CPU usage:
Our alert was trigger due to consistent high load in Cloud SQL. There is no immediate impact to customers, but:
1.5M unprocessed executor job queue
several very expensive queries (24s+) (from caller: internal/codeintel/autoindexing/internal/store.) that seem to be halting the postgres database. this is affecting other components in the deployment, e.g., worker is unable to process any jobs while database is at peak load.
Sub-parts:
[ ] Add columns for indexer (= Docker image name) and root to codeintel_autoindexing_exceptions, so that one can block repos in a more fine-grained way.
Desired semantics: If either indexer or root is non-null, then instead of skipping auto-inference, continue inferring the jobs, but filter them based on the blocked values.
[ ] Set up a worker which goes through the results of auto-indexing on a periodic basis (I think this should involve the lsif_indexes table, but not 100% sure), groups the results by (root, indexer, root), and adds an entry to the codeintel_autoindexing_exceptions table.
IMPORTANT: Think through the case when someone wants to unblock a blocked repo - we don't want to end up in a loop where a manually unblocked repo gets auto-blocked again by the worker.
Q: Should blocking remove queued jobs?
[ ] Update the auto-indexing documentation to include a state transition diagram describing the lifecycle of auto-indexing jobs, including blocking.
[ ] Expose blocked-ness information through the GraphQL API.
[ ] Expose blocked-ness in the site-admin UI when applicable.
We have a table
codeintel_autoindexing_exceptions
which can be manually modified to exclude certain repos from auto-indexing. For context, see:However, the function introduced the PR is only called in test code. This means that when auto-indexing fails a lot of times for a certain repo, we will still keep creating jobs anyways, unless someone actually manually modifies the
codeintel_autoindexing_exceptions
table manually. For example, we just had an incident recently where a customer had 100K+ auto-indexing jobs, and the queue kept growing. This caused high CPU usage:Sub-parts:
codeintel_autoindexing_exceptions
, so that one can block repos in a more fine-grained way.lsif_indexes
table, but not 100% sure), groups the results by(root, indexer, root)
, and adds an entry to thecodeintel_autoindexing_exceptions
table.