sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

Migrator CONCURRENT INDEX CREATION transaction stalls #60684

Open DaedalusG opened 8 months ago

DaedalusG commented 8 months ago

In the 5.2 release Sourcegraph internal deployments failed to upgrade on the following migration during a standard/rolling upgrade:

CREATE INDEX CONCURRENTLY IF NOT EXISTS
lsif_indexes_dequeue_order_idx ON lsif_indexes
USING btree ((enqueuer_user_id > 0) DESC, queued_at DESC, id)
WHERE (state = 'queued' OR state = 'errored');
failed to run migration for schema "frontend": dirty database: schema "frontend" marked the following migrations as failed: 1691759644
The target schema is marked as dirty and no other migration operation is seen running on this schema. The last migration operation over this schema has failed (or, at least, the migrator instance issuing that migration has died). Please contact support@sourcegraph.com for further assistance.

Generally its observed that queries failing in this way are CREATE INDEX CONCURRENTLY operations

datid            | 16470
datname          | pgsql
pid              | 2506705
usesysid         | 76960253
usename          | src-44ea3b88aac67add@src-ee05956e2546c5a0665c.iam
application_name | worker
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2023-10-05 20:28:42.392971+00
xact_start       | 2023-10-05 22:05:49.433722+00
query_start      | 2023-10-05 22:05:49.442534+00
state_change     | 2023-10-05 22:05:49.442556+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle in transaction
backend_xid      |
backend_xmin     | 487455070
query            | -- query hash: 68333039                                                                                                                                                                                                                                 +
                 | -- query length: 290 (2 args)                                                                                                                                                                                                                           +
                 | -- caller: internal/codeintel/uploads/internal/background/commitgraph.(*commitGraphUpdater).lockAndUpdateUploadsVisibleToCommits                                                                                                                        +
                 | -- source: internal/codeintel/uploads/internal/background/commitgraph/job_commitgraph.go:81                                                                                                                                                             +
                 |                                                                                                                                                                                                                                                         +
                 | SELECT pg_try_advisory_xact_lock($1, $2)                                                                                                                                                                                                                +
                 |  /*uber-trace-id='174b3e355d598113c43375246b4be244%3Adca81c383e4de391%3A0%3A1',ot-tracer-traceid='c43375246b4be244',ot-tracer-spanid='dca81c383e4de391',ot-tracer-sampled='true',traceparent='00-174b3e355d598113c43375246b4be244-dca81c383e4de391-01'*/
backend_type     | client backend
SELECT *
FROM pg_stat_progress_create_index;
-[ RECORD 1 ]------+--------------------------
pid                | 2527444
datid              | 16470
datname            | pgsql
relid              | 17520
index_relid        | 128267311
command            | CREATE INDEX CONCURRENTLY
phase              | waiting for old snapshots
lockers_total      | 3
lockers_done       | 0
current_locker_pid | 2506705
blocks_total       | 24352
blocks_done        | 24351
tuples_total       | 0
tuples_done        | 0
partitions_total   | 0
partitions_done    | 0

The issue appears to be migrator entering into a state in which a migration takes a table lock and then gets stuck in ctx, not executing an index creation or releasing the lock. The workaround is to manually terminate the transaction holding the lock.

select pid,
       usename,
       pg_blocking_pids(pid) as blocked_by,
       query as blocked_query
from pg_stat_activity
where cardinality(pg_blocking_pids(pid)) > 0;
SELECT pg_terminate_backend(blocked_by_pid);

This issue so far has only been consistently encountered in Sourcegraph deployments with extensive use of scip and auto-indexing features

Expected and Actual behavior:

The database connection shouldn't hang leaving migrator essentially stuck, and subject to potential timeouts. Migrator should have a way of detecting hanging transactions and auto resolving.

If you would like immediate help on this, please email support@sourcegraph.com (you can still create the issue, but there are no SLAs on issues like there are for support requests).

DaedalusG commented 8 months ago

For Srcerers: https://sourcegraph.slack.com/archives/C05DWT4ANHH/p1707875760272809

michaellzc commented 8 months ago

it happened again on Cloud today:

DaedalusG commented 8 months ago

^^ as noted above

  pid   |                      usename                      | blocked_by |                         blocked_query
--------+---------------------------------------------------+------------+----------------------------------------------------------------
 297341 | src-a673939d002d7566@src-69349f75ded403c7eeae.iam | {292817}   | -- query hash: 2664355207                                     +
        |                                                   |            | -- query length: 142 (0 args)                                 +
        |                                                   |            | -- (could not infer source)                                   +
        |                                                   |            | CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_repo_gitlab_topics+
        |                                                   |            | ON repo                                                       +
        |                                                   |            | USING GIN((metadata->'topics'))                               +
        |                                                   |            | WHERE external_service_type = 'gitlab';

CleanShot 2024-02-22 at 15 03 23

Here again we have a long running query blocking the application of the up migrations to be applied during startup

DaedalusG commented 6 months ago

Internal Lin to another occurrence: https://sourcegraph.slack.com/archives/C0411GPJQTB/p1713201791701729

michaellzc commented 6 months ago

Another one (same migration): https://sourcegraph.slack.com/archives/C0411GPJQTB/p1713210415282729