product-os / jellyfish

The Jellyfish Project
https://jel.ly.fish/
GNU Affero General Public License v3.0
15 stars 3 forks source link

2021-05-29 Outage Notes #6534

Closed joshbwlng closed 1 year ago

joshbwlng commented 3 years ago

Flowdock thread: https://www.flowdock.com/app/rulemotion/resin-devops/threads/KL32MOpuH6cCQLFvSeEvfc3EC44 Jellyfish thread: https://jel.ly.fish/64dac13e-2dc6-4676-ac95-96018f516c71

Next steps

joshbwlng commented 3 years ago

pg-stat-activity.zip

joshbwlng commented 3 years ago

graphile-worker-jobs.zip

joshbwlng commented 3 years ago

JF RDS logs for 2021-05-29: postgres_2021-05-29.zip

LucianBuzzo commented 3 years ago

As a follow up here, we found that import event actions would be triggered, then run async when we receive a webhook and create the corresponding external event. This effectively means there is no throttling on webhooks. A PR to address this issue has been opened on the worker module here https://github.com/product-os/jellyfish-worker/pull/825

joshbwlng commented 3 years ago

Ran some local tests to try and rule out the idea of the jellyfish-core v3 bump as being a possible cause. Tests admittedly are not exhaustive, but no obvious issues were found.

Version 34.10.13 (post core v3 bump)

Create and update cards on loop

jellyfish=# select count(*) from pg_locks;
 count
-------
  1024
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   874
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)

Get card by slug on loop

jellyfish=# select count(*) from pg_locks;
 count
-------
     4
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   148
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
     2
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)

Combination of the two above

jellyfish=# select count(*) from pg_locks;
 count
-------
   733
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   732
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
  1024
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)

Update single card on loop with two user sessions simultaneously

jellyfish=# select count(*) from pg_locks;
 count
-------
  1028
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
  1170
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
  1028
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)

34.7.20 (Pre core v3 bump)

Create and update cards on loop

jellyfish=# select count(*) from pg_locks;
 count
-------
  1027
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
  1091
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)

Get card by slug on loop

jellyfish=# select count(*) from pg_locks;
 count
-------
     2
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   143
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   143
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
     2
(1 row)

Combination of the two above

jellyfish=# select count(*) from pg_locks;
 count
-------
  1020
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   878
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   878
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)

Update single card on loop with two user sessions simultaneously

jellyfish=# select count(*) from pg_locks;
 count
-------
   586
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
  1024
(1 row)

jellyfish=# select count(*) from pg_locks;
 count
-------
   797
(1 row)

jellyfish=# SELECT
  COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
  now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
  blockeda.query as blocked_query, blockedl.mode as blocked_mode,
  blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
  blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
  ( (blockingl.transactionid=blockedl.transactionid) OR
  (blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
  ) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
  AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database();
 locked_item | waiting_duration | blocked_pid | blocked_query | blocked_mode | blocking_pid | blocking_query | blocking_mode
-------------+------------------+-------------+---------------+--------------+--------------+----------------+---------------
(0 rows)
joshbwlng commented 3 years ago

By flooding a local JF instance with multiple simultaneous card create requests (multiple user sessions) and update requests (update same card with multiple user sessions), I was able to get >4000 AccessShareLock locks, but the system never crashed or timed out.

jellyfish=# select count(*) from pg_locks where mode='AccessShareLock'; select count(*) from pg_locks where mode='ExclusiveLock';
 count
-------
  4351
(1 row)

 count
-------
    31
(1 row)

jellyfish=# select count(*) from pg_locks where mode='AccessShareLock'; select count(*) from pg_locks where mode='ExclusiveLock';
 count
-------
  4206
(1 row)

 count
-------
    32
(1 row)