Open Runi-c opened 2 years ago
this happened to me when a task running a long migration was killed by ECS for failing health checks -- any new deployments got timed out trying to acquire a postgres advisory lock (SELECT pg_advisory_lock(72707369)). Elapsed: 10000ms. See https://pris.ly/d/migrate-advisory-locking for details.
@shawnjones253 and others:
migrate
on that database at that moment?Our docs on our Advisory locking mechanism (also applies to migrate dev
)
This safeguard ensures that multiple commands cannot run at the same time - for example, if you merge two pull requests in quick succession. Advisory locking has a 10 second timeout (not configurable), and uses the default advisory locking mechanism available in the underlying provider https://www.prisma.io/docs/concepts/components/prisma-migrate/migrate-development-production#advisory-locking
Usually a timeout happens because you already have migrate
running on that database, and you started migrate
again on the same database. To protect your database from concurrent execution, we use a lock.
Related work that happened recently about advisory locking that might help but comes with a warning: In https://github.com/prisma/prisma-engines/pull/4171 I added a new environment variable to disable the advisory locking, which was needed for people when using a Percona-XtraDB-Cluster or MariaDB Galera Cluster.
⚠️ In this case here, only use this environment variable if you are sure that you won't start migrate
concurrently on your database. You must ensure that only 1 instance of migrate
can run at any time when using migrate dev/deploy/resolve
.
It can be used like (docs PR):
PRISMA_SCHEMA_DISABLE_ADVISORY_LOCK=1
Note: This is included in 5.3.0-dev.13
if you want to try it out.
It will be released in the official 5.3.0 version on Tuesday, September, 12th.
@Jolg42
I believe I understand better what was happening now:
We run migrate
when our container starts like: npm migrate && npm serve
.
Because the first container to try the migration was killed for taking too long to respond to healthchecks, I assumed the db calls started by the migration was also aborted and that subsequent containers should have been able to attempt to run the migration. Upon further investigation I noticed the statement was still running in postgres by querying pg_stat_activity
. So I think the session from the first container was still alive, and that's why the advisory lock hadn't been released? It had been running for about 30m when I killed it and that released the lock and aborted the big update statement.
@Roryl-c is restarting the RDS instance the only way to unlock it again after a failed migration? That'd be quite frustrating, especially on a production CI workflow, if any migration does fail for whatever reason.
Did you resort to disabling advisory locking going forward?
Same problem here what have been your solution?
I am facing a similar issue on a managed instance from Digitalocean. is there any way to fix this without restarting the database since it's a production?
So if you want to have more data on the lock, you could try this query:
SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa
ON pl.pid = psa.pid WHERE objid = 72707369;
Example from a local db for me (after manually running SELECT pg_advisory_lock(72707369); ) |
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
advisory | 57371 | 0 | 72707369 | 1 | 4/111 | 13716 | ExclusiveLock | t | f | 57371 | 20586 | 13716 | 10 | prisma | TablePlus | 192.168.65.1 | 17685 | 2024-06-07 14:31:43.58545+00 | 2024-06-07 14:34:21.740906+00 | 2024-06-07 14:34:21.740906+00 | 2024-06-07 14:34:21.74091+00 | active | 46716 | SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa | |||||||||||
ON pl.pid = psa.pid WHERE objid = 72707369; | client backend |
Docs: https://www.postgresql.org/docs/9.4/view-pg-locks.html
You can see in the code that 72707369
is what we use here https://github.com/prisma/prisma-engines/blob/9ee8c029573a3ce47bd5f103da9feaf8b6f62c53/schema-engine/connectors/sql-schema-connector/src/flavour/postgres.rs#L121-L126
-- This would get a lock (interesting to try out locally)
-- SELECT pg_advisory_lock(72707369);
-- Then to release the lock, see example from
-- https://stackoverflow.com/a/55197736/1345244
-- Get the pid
SELECT pid FROM pg_locks WHERE objid = 72707369;
-- Cancel (takes a while)
SELECT pg_cancel_backend(__pid_from_previous_query__);
-- Terminate
SELECT pg_terminate_backend(__pid_from_previous_query__);
This has been happening with Digital Ocean app platform since February/24. Randomly when new deploys come in, it sometimes locks the application by failing to acquire an advisory lock after 10s.
I suspect it's due to multiple instances of the application trying to acquire a lock concurrently, but I'm not quite sure how to confirm that.
I've been working around it by removing the locks and fixing the migrations, so I don't have to restart the database.
When that happens, what I have to do in order to fix it is this:
First, I check if there are advisory locks stuck:
SELECT
*
FROM pg_stat_activity
where
query ilike '%72707369%'
and state = 'active'
and wait_event_type = 'Lock'
and wait_event = 'advisory';
If there are any, I kill all prisma advisory locks:
SELECT
pg_terminate_backend(pid)
FROM pg_stat_activity
where
query ilike '%72707369%'
and state = 'active'
and wait_event_type = 'Lock'
and wait_event = 'advisory';
Then I look on the _prisma_migrations
table to see which migration is marked as failed.
In this step, I need to manually check if the migration was executed or not - because I've seen occasions where the migration was successful but marked as failed in the database.
For instance, imagine this is the migration:
202408270000_add_column_y
and the content is
ALTER TABLE table_x ADD COLUMN "column_y" VARCHAR;
At that point, I'll try to run select column_y from table_x
to confirm if the column exists or not.
If the column doesn't exist, I'll delete the row for the migration 202408270000_add_column_y
from _prisma_migrations
.
The last step is to make sure that the migration code is not stuck running. In this case, I'll see if there are any queries running that are doing an Alter table, because the migration was doing that.
SELECT
*
FROM pg_stat_activity
where
state = 'active'
and wait_event_type = 'Lock'
and query like '%alter%';
If I find the migration code stuck there, I will kill that query.
SELECT
pg_terminate_backend(pid)
FROM pg_stat_activity
where
state = 'active'
and wait_event_type = 'Lock'
and query like '%alter%';
After that, you should be able to try to run the migrations again.
I think that if you remove concurrency from the equation it should work fine, so maybe try to move that to your CI/CD pipeline like they recommend at the docs
Bug description
I'm developing against an Amazon RDS PostgreSQL instance.
I've noticed that every time I try a migration that fails for some expected reason (e.g. I need to edit the migration to handle a required column being added or something), I'm unable to run subsequent migrations due to advisory lock timeouts. Each attempt opens a new connection, and the connections stay open for an unspecified time period (10-15 minutes?). Development can't continue properly for this time period without taking some other action such as restarting the database instance to resolve the issue.
You can see in the above image that each attempt at migrating opens a new connection waiting on the advisory lock. I was unable to complete the migration to resume development during that period. I made several more attempts just to confirm the issue and demonstrate the problem.
How to reproduce
prisma migrate dev
, have it fail.prisma migrate dev --create-only
as suggested, it should time out waiting on the advisory lock.Expected behavior
Prisma migrate should explicitly drop the locks to prevent this interruption to workflow
Prisma information
N/A
Environment & setup
Prisma Version