hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.05k stars 2.76k forks source link

Event triggers: Failed inserts into "hdb_catalog.event_invocation_logs" leave blocked queries in "pg_stat_activity" #6485

Open zynyz opened 3 years ago

zynyz commented 3 years ago

I'm playing with insert/update events and seeing some unexpected side effects.

The endpoint is a simple gcloud function that logs its input and exits. That part is working fine, but tables in hdb_catalog aren't getting updated properly.

The problem seems to stem from #4009. event_id does not get generated, so the insert fails and no hdb_catalog.hdb_action_log record gets inserted. The logic that handles updating the event status doesn't happen since there is nothing in hdb_catalog.hdb_action_log to update.

The end result (aside from events never being processed or delivered) is a whole lot of blocked queries that never complete, and all of the available connection slots are used up.

INSERT INTO hdb_catalog.event_invocation_logs (event_id, status, request, response) fails:

ERROR:  null value in column "id" violates not-null constraint 
DETAIL:  Failing row contains (null, 63f5f204-00a8-471f-be09-4b49e2f4741f, 200, {"payload":{...}")
INSERT INTO hdb_catalog.event_invocation_logs (event_id, status, request, response) VALUES ($1, $2, $3, $4)

select count(*) from hdb_catalog.event_invocation_logs ==> 0 rows select count(*) from hdb_catalog.hdb_action_log ==> 0 rows

Checking for waiting queries:

select pid,usename,application_name,query_start,backend_start,wait_event,query from pg_stat_activity where datname = 'db_name' order by query_start desc;

returns lots of queries like these:

      update hdb_catalog.hdb_action_log set status = 'processing'
        where
          id in (
            select id from hdb_catalog.hdb_action_log
            where status = 'created'
            for update skip locked limit 10
          )
        returning
          id, action_name, request_headers::json, session_variables::json, input_payload::json

which, since hdb_catalog.hdb_action_log has no rows, will never do anything.

Soon enough, no one can connect to the db: FATAL: remaining connection slots are reserved for non-replication superuser connections

This is Hasura v1.3.3 and Postgres 11.9 on Google Cloud.

tirumaraiselvan commented 3 years ago

Seems like you might have lost the pgcrypto extension or re-installed it, etc.

Pls run the SQL here and try again: https://github.com/hasura/graphql-engine/issues/4009#issuecomment-715795849

zynyz commented 3 years ago

Thanks for getting back to me so quickly. It's a nice change from other sites I could mention.

That did happen at one point, and I've fixed it (thank you for pointing that out). What I'm concerned about, though, is the side effect of the missing column default.

The initial insert into the event log fails, but the rest of the system continues on, so we wind up with queries that never complete. This results in events that never get processed and marked as delivered (and never get retried), but even worse, the blocked queries use up all the available connections and Postgres needs to be restarted.

This issue is easy to avoid, but the problem is that the cause is obscure and by the time you realize there's a problem, your database is borked.

tirumaraiselvan commented 3 years ago

The initial insert into the event log fails, but the rest of the system continues on, so we wind up with queries that never complete. This results in events that never get processed and marked as delivered (and never get retried), but even worse, the blocked queries use up all the available connections and Postgres needs to be restarted.

I am not sure why all the PG connections are maxed out because of this. It could be a case where you have lots of events pending and post-processing, one connection is used to mark it as delivered. This would fail because there is no pgcrypto extension (as there is an insert of a new row in hdb_catalog.event_invocation_logs table which uses pgcrypto). These queries shouldn't be blocked (it's a single row update and single row insert) -- they would just fail. I think the issue with maxing out PG connections could be likely because you have many events which are in pending state and they are all trying to do post-processing simultaneously.

zynyz commented 3 years ago

I'll admit, I'm not that familiar with what Postgres is doing behind the scenes. This was my interpretation of the data I had, so I may be completely off-base about this.

I removed the default value for hdb_catalog.event_invocation_logs.id to recreate the problem. The insert into event_invocation_logs fails, leaving lots of these behind in pg_stat_activity:

-[ RECORD 14 ]+------------------------------------------------------------------------------------------------
backend_start | 2021-01-29 20:55:50.332948+00
xact_start    | [NULL]
query_start   | 2021-01-29 20:58:05.857545+00
state_change  | 2021-01-29 20:58:05.857644+00
wait_event    | ClientRead
state         | idle
query         |                                                                                                
              |         update hdb_catalog.hdb_action_log set status = 'processing'                            
              |         where                                                                                  
              |           id in (                                                                              
              |             select id from hdb_catalog.hdb_action_log                                          
              |             where status = 'created'                                                           
              |             for update skip locked limit 10                                                    
              |           )                                                                                    
              |         returning                                                                              
              |           id, action_name, request_headers::json, session_variables::json, input_payload::json 
              |
-[ RECORD 15 ]+------------------------------------------------------------------------------------------------
backend_start | 2021-01-29 20:48:13.482425+00
xact_start    | [NULL]
query_start   | 2021-01-29 20:58:05.905672+00
state_change  | 2021-01-29 20:58:05.90588+00
wait_event    | ClientRead
state         | idle
query         |                                                                                                
              |       UPDATE hdb_catalog.event_log                                                             
              |       SET locked = 't'                                                                         
              |       WHERE id IN ( SELECT l.id                                                                
              |                     FROM hdb_catalog.event_log l                                               
              |                     WHERE l.delivered = 'f' and l.error = 'f' and l.locked = 'f'               
              |                           and (l.next_retry_at is NULL or l.next_retry_at <= now())            
              |                           and l.archived = 'f'                                                 
              |                     ORDER BY created_at                                                        
              |                     LIMIT $1                                                                   
              |                     FOR UPDATE SKIP LOCKED )                                                   
              |       RETURNING id, schema_name, table_name, trigger_name, payload::json, tries, created_at    
              |

I don't know if FOR UPDATE SKIP LOCKED makes any difference or not.

I haven't been able to recreate the no-available-connections issue yet, so that may have been coincidence.

Either way, there are a lot of events in the GUI that are marked as not delivered (even though they were) and never get an entry in the 'processed' or 'invocation' logs.

As a side note, there are 6 processes listening for schema updates, with consecutive PIDs. Is this the standard behavior, or is there sonething that we need to tune?

  pid   |         backend_start         | xact_start |          query_start          |         state_change          | wait_event | state |             query
--------+-------------------------------+------------+-------------------------------+-------------------------------+------------+-------+-------------------------------
 215598 | 2021-01-29 01:20:53.065087+00 | [NULL]     | 2021-01-29 01:20:53.888259+00 | 2021-01-29 01:20:53.888317+00 | ClientRead | idle  | LISTEN  hasura_schema_update;
 215599 | 2021-01-29 01:20:53.146257+00 | [NULL]     | 2021-01-29 01:20:54.073265+00 | 2021-01-29 01:20:54.073618+00 | ClientRead | idle  | LISTEN  hasura_schema_update;
 215600 | 2021-01-29 01:20:53.461244+00 | [NULL]     | 2021-01-29 01:20:54.227973+00 | 2021-01-29 01:20:54.228052+00 | ClientRead | idle  | LISTEN  hasura_schema_update;
 260409 | 2021-01-29 16:59:05.82241+00  | [NULL]     | 2021-01-29 16:59:06.728206+00 | 2021-01-29 16:59:06.728351+00 | ClientRead | idle  | LISTEN  hasura_schema_update;
 260410 | 2021-01-29 16:59:05.919267+00 | [NULL]     | 2021-01-29 16:59:07.096014+00 | 2021-01-29 16:59:07.096129+00 | ClientRead | idle  | LISTEN  hasura_schema_update;
 260411 | 2021-01-29 16:59:05.937349+00 | [NULL]     | 2021-01-29 16:59:07.091174+00 | 2021-01-29 16:59:07.091267+00 | ClientRead | idle  | LISTEN  hasura_schema_update;