Advisory locks not released on failed migrations

Runi-c commented 2 years ago

Bug description

I'm developing against an Amazon RDS PostgreSQL instance.

I've noticed that every time I try a migration that fails for some expected reason (e.g. I need to edit the migration to handle a required column being added or something), I'm unable to run subsequent migrations due to advisory lock timeouts. Each attempt opens a new connection, and the connections stay open for an unspecified time period (10-15 minutes?). Development can't continue properly for this time period without taking some other action such as restarting the database instance to resolve the issue.

You can see in the above image that each attempt at migrating opens a new connection waiting on the advisory lock. I was unable to complete the migration to resume development during that period. I made several more attempts just to confirm the issue and demonstrate the problem.

How to reproduce

Set up an Amazon RDS PostgreSQL instance (small ones are free), though this problem should be reproducible in other environments with PostgreSQL as well.
Initialize with some basic schema.
Insert some data.
Make a change to the schema that requires manual intervention (e.g. add required column).
Try running prisma migrate dev, have it fail.
Try running prisma migrate dev --create-only as suggested, it should time out waiting on the advisory lock.
All subsequent migrations will fail. The advisory lock is held for an extended period, and each command adds a new open connection waiting on a lock. Development cannot resume until the database server finally drops the connections.

Expected behavior

Prisma migrate should explicitly drop the locks to prevent this interruption to workflow

Prisma information

N/A

Environment & setup

OS: Windows 10
Database: PostgreSQL
Node.js version: v16.14.2

Prisma Version

prisma                  : 3.12.0
@prisma/client          : 3.12.0
Current platform        : windows
Query Engine (Node-API) : libquery-engine 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\query_engine-windows.dll.node)
Migration Engine        : migration-engine-cli 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\migration-engine-windows.exe)
Introspection Engine    : introspection-core 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\introspection-engine-windows.exe)
Format Binary           : prisma-fmt 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\prisma-fmt-windows.exe)
Default Engines Hash    : 22b822189f46ef0dc5c5b503368d1bee01213980
Studio                  : 0.459.0

shawnjones253 commented 1 year ago

this happened to me when a task running a long migration was killed by ECS for failing health checks -- any new deployments got timed out trying to acquire a postgres advisory lock (SELECT pg_advisory_lock(72707369)). Elapsed: 10000ms. See https://pris.ly/d/migrate-advisory-locking for details.

Jolg42 commented 1 year ago

@shawnjones253 and others:

What's your PostgreSQL version here?
Are you sure that you are only running 1 instance of migrate on that database at that moment?
Is the migration a long-running migration? How much time does this migration take to complete?

Our docs on our Advisory locking mechanism (also applies to migrate dev)

This safeguard ensures that multiple commands cannot run at the same time - for example, if you merge two pull requests in quick succession. Advisory locking has a 10 second timeout (not configurable), and uses the default advisory locking mechanism available in the underlying provider https://www.prisma.io/docs/concepts/components/prisma-migrate/migrate-development-production#advisory-locking

Usually a timeout happens because you already have migrate running on that database, and you started migrate again on the same database. To protect your database from concurrent execution, we use a lock.

Jolg42 commented 1 year ago

Related work that happened recently about advisory locking that might help but comes with a warning: In https://github.com/prisma/prisma-engines/pull/4171 I added a new environment variable to disable the advisory locking, which was needed for people when using a Percona-XtraDB-Cluster or MariaDB Galera Cluster.

⚠️ In this case here, only use this environment variable if you are sure that you won't start migrate concurrently on your database. You must ensure that only 1 instance of migrate can run at any time when using migrate dev/deploy/resolve.

It can be used like (docs PR):

PRISMA_SCHEMA_DISABLE_ADVISORY_LOCK=1

Note: This is included in 5.3.0-dev.13 if you want to try it out. It will be released in the official 5.3.0 version on Tuesday, September, 12th.

shawnjones253 commented 1 year ago

@Jolg42

postgresql 14.2
see below
yes, it ended up being a much longer migration than expected (hours). For context it was one update statement that was estimated to touch 12M rows with a join.

I believe I understand better what was happening now:

We run migrate when our container starts like: npm migrate && npm serve.

Because the first container to try the migration was killed for taking too long to respond to healthchecks, I assumed the db calls started by the migration was also aborted and that subsequent containers should have been able to attempt to run the migration. Upon further investigation I noticed the statement was still running in postgres by querying pg_stat_activity. So I think the session from the first container was still alive, and that's why the advisory lock hadn't been released? It had been running for about 30m when I killed it and that released the lock and aborted the big update statement.

malkhuzayyim commented 1 year ago

@Roryl-c is restarting the RDS instance the only way to unlock it again after a failed migration? That'd be quite frustrating, especially on a production CI workflow, if any migration does fail for whatever reason.

Did you resort to disabling advisory locking going forward?

peterHakio commented 9 months ago

Same problem here what have been your solution?

brookmg commented 5 months ago

I am facing a similar issue on a managed instance from Digitalocean. is there any way to fix this without restarting the database since it's a production?

Jolg42 commented 5 months ago

So if you want to have more data on the lock, you could try this query:

SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa
    ON pl.pid = psa.pid WHERE objid = 72707369;

Example from a local db for me (after manually running `SELECT pg_advisory_lock(72707369);`)	locktype	database	relation	page	tuple	virtualxid	transactionid	classid	objid	objsubid	virtualtransaction	pid	mode	granted	fastpath	datid	datname	pid	usesysid	usename	application_name	client_addr	client_hostname	client_port	backend_start	xact_start	query_start	state_change	wait_event_type	wait_event	state	backend_xid	backend_xmin	query	backend_type
advisory	57371						0	72707369	1	4/111	13716	ExclusiveLock	t	f	57371	20586	13716	10	prisma	TablePlus	192.168.65.1		17685	2024-06-07 14:31:43.58545+00	2024-06-07 14:34:21.740906+00	2024-06-07 14:34:21.740906+00	2024-06-07 14:34:21.74091+00			active		46716	SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa
ON pl.pid = psa.pid WHERE objid = 72707369;	client backend

Docs: https://www.postgresql.org/docs/9.4/view-pg-locks.html

You can see in the code that 72707369 is what we use here https://github.com/prisma/prisma-engines/blob/9ee8c029573a3ce47bd5f103da9feaf8b6f62c53/schema-engine/connectors/sql-schema-connector/src/flavour/postgres.rs#L121-L126

-- This would get a lock (interesting to try out locally)
-- SELECT pg_advisory_lock(72707369);

-- Then to release the lock, see example from
-- https://stackoverflow.com/a/55197736/1345244

-- Get the pid
SELECT pid FROM pg_locks WHERE objid = 72707369;

-- Cancel (takes a while)
SELECT pg_cancel_backend(__pid_from_previous_query__);

-- Terminate
SELECT pg_terminate_backend(__pid_from_previous_query__);

Lipemenezes commented 2 months ago

This has been happening with Digital Ocean app platform since February/24. Randomly when new deploys come in, it sometimes locks the application by failing to acquire an advisory lock after 10s.

I suspect it's due to multiple instances of the application trying to acquire a lock concurrently, but I'm not quite sure how to confirm that.

I've been working around it by removing the locks and fixing the migrations, so I don't have to restart the database.

When that happens, what I have to do in order to fix it is this:

First, I check if there are advisory locks stuck:

SELECT 
    *
FROM pg_stat_activity 
where 
    query ilike '%72707369%' 
    and state = 'active'
    and wait_event_type = 'Lock'
    and wait_event = 'advisory';

If there are any, I kill all prisma advisory locks:

SELECT 
    pg_terminate_backend(pid)
FROM pg_stat_activity 
where 
    query ilike '%72707369%' 
    and state = 'active'
    and wait_event_type = 'Lock'
    and wait_event = 'advisory';

Then I look on the _prisma_migrations table to see which migration is marked as failed. In this step, I need to manually check if the migration was executed or not - because I've seen occasions where the migration was successful but marked as failed in the database. For instance, imagine this is the migration: 202408270000_add_column_y and the content is

ALTER TABLE table_x ADD COLUMN     "column_y" VARCHAR;

At that point, I'll try to run select column_y from table_x to confirm if the column exists or not. If the column doesn't exist, I'll delete the row for the migration 202408270000_add_column_y from _prisma_migrations.

The last step is to make sure that the migration code is not stuck running. In this case, I'll see if there are any queries running that are doing an Alter table, because the migration was doing that.

SELECT 
    *
FROM pg_stat_activity 
where 
    state = 'active'
    and wait_event_type = 'Lock'
        and query like '%alter%';

If I find the migration code stuck there, I will kill that query.

SELECT 
    pg_terminate_backend(pid)
FROM pg_stat_activity 
where 
    state = 'active'
    and wait_event_type = 'Lock'
        and query like '%alter%';

After that, you should be able to try to run the migrations again.

I think that if you remove concurrency from the equation it should work fine, so maybe try to move that to your CI/CD pipeline like they recommend at the docs

prisma / prisma