Closed joshbwlng closed 1 year ago
JF RDS logs for 2021-05-29: postgres_2021-05-29.zip
As a follow up here, we found that import event actions would be triggered, then run async when we receive a webhook and create the corresponding external event. This effectively means there is no throttling on webhooks. A PR to address this issue has been opened on the worker module here https://github.com/product-os/jellyfish-worker/pull/825
Ran some local tests to try and rule out the idea of the jellyfish-core
v3 bump as being a possible cause. Tests admittedly are not exhaustive, but no obvious issues were found.
jellyfish=# select count(*) from pg_locks;
count
-------
1024
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
874
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
jellyfish=# select count(*) from pg_locks;
count
-------
4
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
148
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
2
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
jellyfish=# select count(*) from pg_locks;
count
-------
733
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
732
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
1024
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
jellyfish=# select count(*) from pg_locks;
count
-------
1028
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
1170
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
1028
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
jellyfish=# select count(*) from pg_locks;
count
-------
1027
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
1091
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
jellyfish=# select count(*) from pg_locks;
count
-------
2
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
143
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
143
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
2
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
1020
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
878
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
878
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
jellyfish=# select count(*) from pg_locks;
count
-------
586
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
1024
(1 row)
jellyfish=# select count(*) from pg_locks;
count
-------
797
(1 row)
jellyfish=# SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
By flooding a local JF instance with multiple simultaneous card create requests (multiple user sessions) and update requests (update same card with multiple user sessions), I was able to get >4000 AccessShareLock
locks, but the system never crashed or timed out.
jellyfish=# select count(*) from pg_locks where mode='AccessShareLock'; select count(*) from pg_locks where mode='ExclusiveLock';
count
-------
4351
(1 row)
count
-------
31
(1 row)
jellyfish=# select count(*) from pg_locks where mode='AccessShareLock'; select count(*) from pg_locks where mode='ExclusiveLock';
count
-------
4206
(1 row)
count
-------
32
(1 row)
Flowdock thread: https://www.flowdock.com/app/rulemotion/resin-devops/threads/KL32MOpuH6cCQLFvSeEvfc3EC44 Jellyfish thread: https://jel.ly.fish/64dac13e-2dc6-4676-ac95-96018f516c71
34.8.0
to34.7.20
in case the upgrade tojellyfish-core
v3 was related. This is unlikely as this PR was merged roughly 16 hours before the outage occurred, have since upgraded back and have not experienced any issues.action-server
instances after downgrading to34.7.20
to allow the system to work through the queue without overloading the database.action-server
instances back to 12.01:30:00
and01:45:00
. It seems that the vast majority of these requests were triggered by events in Jellyfish repositories by renovate, balena-ci, and versionbot bots: https://docs.google.com/spreadsheets/d/1XPZQeyJ-McU5f8EOsWkngNAyJyAUxhlsekvX_kYoTFI/edit?usp=sharingexternal-event
SQL SELECT output used in the sheet above: output.zipNext steps