Open zynyz opened 3 years ago
Seems like you might have lost the pgcrypto
extension or re-installed it, etc.
Pls run the SQL here and try again: https://github.com/hasura/graphql-engine/issues/4009#issuecomment-715795849
Thanks for getting back to me so quickly. It's a nice change from other sites I could mention.
That did happen at one point, and I've fixed it (thank you for pointing that out). What I'm concerned about, though, is the side effect of the missing column default.
The initial insert into the event log fails, but the rest of the system continues on, so we wind up with queries that never complete. This results in events that never get processed and marked as delivered (and never get retried), but even worse, the blocked queries use up all the available connections and Postgres needs to be restarted.
This issue is easy to avoid, but the problem is that the cause is obscure and by the time you realize there's a problem, your database is borked.
The initial insert into the event log fails, but the rest of the system continues on, so we wind up with queries that never complete. This results in events that never get processed and marked as delivered (and never get retried), but even worse, the blocked queries use up all the available connections and Postgres needs to be restarted.
I am not sure why all the PG connections are maxed out because of this. It could be a case where you have lots of events pending and post-processing, one connection is used to mark it as delivered. This would fail because there is no pgcrypto
extension (as there is an insert of a new row in hdb_catalog.event_invocation_logs
table which uses pgcrypto). These queries shouldn't be blocked (it's a single row update and single row insert) -- they would just fail. I think the issue with maxing out PG connections could be likely because you have many events which are in pending state and they are all trying to do post-processing simultaneously.
I'll admit, I'm not that familiar with what Postgres is doing behind the scenes. This was my interpretation of the data I had, so I may be completely off-base about this.
I removed the default value for hdb_catalog.event_invocation_logs.id
to recreate the problem. The insert into event_invocation_logs
fails, leaving lots of these behind in pg_stat_activity
:
-[ RECORD 14 ]+------------------------------------------------------------------------------------------------
backend_start | 2021-01-29 20:55:50.332948+00
xact_start | [NULL]
query_start | 2021-01-29 20:58:05.857545+00
state_change | 2021-01-29 20:58:05.857644+00
wait_event | ClientRead
state | idle
query |
| update hdb_catalog.hdb_action_log set status = 'processing'
| where
| id in (
| select id from hdb_catalog.hdb_action_log
| where status = 'created'
| for update skip locked limit 10
| )
| returning
| id, action_name, request_headers::json, session_variables::json, input_payload::json
|
-[ RECORD 15 ]+------------------------------------------------------------------------------------------------
backend_start | 2021-01-29 20:48:13.482425+00
xact_start | [NULL]
query_start | 2021-01-29 20:58:05.905672+00
state_change | 2021-01-29 20:58:05.90588+00
wait_event | ClientRead
state | idle
query |
| UPDATE hdb_catalog.event_log
| SET locked = 't'
| WHERE id IN ( SELECT l.id
| FROM hdb_catalog.event_log l
| WHERE l.delivered = 'f' and l.error = 'f' and l.locked = 'f'
| and (l.next_retry_at is NULL or l.next_retry_at <= now())
| and l.archived = 'f'
| ORDER BY created_at
| LIMIT $1
| FOR UPDATE SKIP LOCKED )
| RETURNING id, schema_name, table_name, trigger_name, payload::json, tries, created_at
|
I don't know if FOR UPDATE SKIP LOCKED
makes any difference or not.
I haven't been able to recreate the no-available-connections issue yet, so that may have been coincidence.
Either way, there are a lot of events in the GUI that are marked as not delivered (even though they were) and never get an entry in the 'processed' or 'invocation' logs.
As a side note, there are 6 processes listening for schema updates, with consecutive PIDs. Is this the standard behavior, or is there sonething that we need to tune?
pid | backend_start | xact_start | query_start | state_change | wait_event | state | query
--------+-------------------------------+------------+-------------------------------+-------------------------------+------------+-------+-------------------------------
215598 | 2021-01-29 01:20:53.065087+00 | [NULL] | 2021-01-29 01:20:53.888259+00 | 2021-01-29 01:20:53.888317+00 | ClientRead | idle | LISTEN hasura_schema_update;
215599 | 2021-01-29 01:20:53.146257+00 | [NULL] | 2021-01-29 01:20:54.073265+00 | 2021-01-29 01:20:54.073618+00 | ClientRead | idle | LISTEN hasura_schema_update;
215600 | 2021-01-29 01:20:53.461244+00 | [NULL] | 2021-01-29 01:20:54.227973+00 | 2021-01-29 01:20:54.228052+00 | ClientRead | idle | LISTEN hasura_schema_update;
260409 | 2021-01-29 16:59:05.82241+00 | [NULL] | 2021-01-29 16:59:06.728206+00 | 2021-01-29 16:59:06.728351+00 | ClientRead | idle | LISTEN hasura_schema_update;
260410 | 2021-01-29 16:59:05.919267+00 | [NULL] | 2021-01-29 16:59:07.096014+00 | 2021-01-29 16:59:07.096129+00 | ClientRead | idle | LISTEN hasura_schema_update;
260411 | 2021-01-29 16:59:05.937349+00 | [NULL] | 2021-01-29 16:59:07.091174+00 | 2021-01-29 16:59:07.091267+00 | ClientRead | idle | LISTEN hasura_schema_update;
I'm playing with insert/update events and seeing some unexpected side effects.
The endpoint is a simple gcloud function that logs its input and exits. That part is working fine, but tables in
hdb_catalog
aren't getting updated properly.The problem seems to stem from #4009.
event_id
does not get generated, so the insert fails and nohdb_catalog.hdb_action_log
record gets inserted. The logic that handles updating the event status doesn't happen since there is nothing inhdb_catalog.hdb_action_log
to update.The end result (aside from events never being processed or delivered) is a whole lot of blocked queries that never complete, and all of the available connection slots are used up.
INSERT INTO hdb_catalog.event_invocation_logs (event_id, status, request, response)
fails:select count(*) from hdb_catalog.event_invocation_logs
==> 0 rowsselect count(*) from hdb_catalog.hdb_action_log
==> 0 rowsChecking for waiting queries:
select pid,usename,application_name,query_start,backend_start,wait_event,query from pg_stat_activity where datname = 'db_name' order by query_start desc;
returns lots of queries like these:
which, since
hdb_catalog.hdb_action_log
has no rows, will never do anything.Soon enough, no one can connect to the db:
FATAL: remaining connection slots are reserved for non-replication superuser connections
This is
Hasura v1.3.3
andPostgres 11.9
on Google Cloud.