supabase / pg_net

A PostgreSQL extension that enables asynchronous (non-blocking) HTTP/HTTPS requests with SQL
https://supabase.github.io/pg_net
Apache License 2.0
213 stars 16 forks source link

cleanup of net._http_response table #97

Open vickkhera opened 1 year ago

vickkhera commented 1 year ago

Improve documentation

Link

README.md in the repo

Describe the problem

It is unclear whether one is supposed to clean up the net._http_response table or if there is a task that does so. I'm assuming not, because every use case would have a different requirement for retention of that data. The docs clearly state that the net.http_request_queue is self-cleaning.

Describe the improvement

State whether that table should be periodically cleaned by the user.

Additional context

none.

TheOtherBrian1 commented 1 year ago

Hello, @vickkhera,

By default, whenever the extension runs, it automatically deletes all rows that were created 6 or more hours ago. This design was specifically tailored for Supabase's webhook feature, as it doesn't require long-term data retention. However, if you need to adjust the delete frequency, you can modify the ttl (time to live) interval variable in worker.c.

It's important to note that the table is unlogged, meaning that data won't be preserved in case PostgreSQL crashes or shuts down. This decision was made to improve execution speeds. If you have specific requirements for data preservation, you can modify the table to be logged. However, do consider that it may have a more significant impact on your instance's overall performance, so make sure it's necessary for your project before making the change.

vickkhera commented 1 year ago

Thanks. That's helpful. I think I prefer to change my code to meet these assumptions than to change the extension.

However, I am noticing an issue that is causing some trouble for me. Every time I restart the DB, the net.http_request_queue_id_seq seems to reset.

[Vicks-M1-Air]% psql postgresql://postgres:postgres@localhost:54322/postgres
Timing is on.
Expanded display is used automatically.
psql (15.3 (Homebrew), server 15.1 (Ubuntu 15.1-1.pgdg20.04+1))
Type "help" for help.

postgres=> select net.http_post('http://www.example.com/foo');
 http_post
-----------
         3
(1 row)

Time: 10.908 ms
postgres=> select net.http_post('http://www.example.com/foo');
 http_post
-----------
         4
(1 row)

Time: 2.758 ms
postgres=>
\q
[Vicks-M1-Air]% npx supabase stop
Postgres database saved to volume: supabase_db_XX
Postgres config saved to volume: supabase_config_XX
Storage directory saved to volume: supabase_storage_XX
Stopped supabase local development setup.
[Vicks-M1-Air]% npx supabase start
Started supabase local development setup.

         API URL: http://localhost:54321
     GraphQL URL: http://localhost:54321/graphql/v1
          DB URL: postgresql://postgres:postgres@localhost:54322/postgres
      Studio URL: http://localhost:54323
    Inbucket URL: http://localhost:54324
   [ ... keys removed ... ]
[Vicks-M1-Air]% psql postgresql://postgres:postgres@localhost:54322/postgres
Timing is on.
Expanded display is used automatically.
psql (15.3 (Homebrew), server 15.1 (Ubuntu 15.1-1.pgdg20.04+1))
Type "help" for help.

postgres=> select net.http_post('http://www.example.com/foo');
 http_post
-----------
         1
(1 row)

Time: 12.237 ms
postgres=> select net.http_post('http://www.example.com/foo');
 http_post
-----------
         2
(1 row)

Time: 3.298 ms
postgres=>
\q

That seems deliberate -- there's nothing that should cause an unlogged table to reset its sequence on server restart.

Why this is a problem for me: I keep track of async jobs I submit to an API via pg_net. Every so often (obviously I have to do this < 6 hours based on your comment), I check to see if the net._http_response has a success response for that async job. I obviously use the ID returned from the post to check, as there's no other way to link the request to the response. If the response is not success, I can re-submit my request via the API.

This breaks down if the sequence is ever reset.

I originally tried to use an ON INSERT trigger on the response table to delete my async job tracking record. This would let me just look for any records there that are more than X minutes old and retry them. I wasn't unable to get a trigger to properly work on that table, though.

What is the reason that sequence get reset on every restart? Can it be changed to not do so?

vickkhera commented 1 year ago

What is the reason that sequence get reset on every restart? Can it be changed to not do so?

I did a bunch more reading, and it seems that it is indeed how unlogged tables tend to work in Postgres. The documentation is horribly lacking in this. I will ask them to add clarifications. Specifically, it says that the table may (in practice likely will) be truncated on restart. However, a manual truncate on a table doesn't reset the sequence whereas it does here. They also document that happens only on unclean shutdown, but experience by others seems to show otherwise.

I will figure out a way to track which jobs succeeded in submitting via the REST API some other way, since this ID is not durable. :shrug:

vickkhera commented 1 year ago

@TheOtherBrian1

I'm re-evaluating my strategy to use a trigger on net._http_response to clean up my async job tracking queue. Is that a safe thing to do? What are the potential consequences of creating a trigger on that table? My concerns are with upgrades to the extension if my trigger will potentially conflict.

TheOtherBrian1 commented 1 year ago

@vickkhera, please consider the following advice with a grain of salt. While I have contributed to the extension by writing most of the documentation and fixing a few bugs, I am not the maintainer nor the primary expert on it. Trust your judgment and testing above all else.

Unlogged tables in PostgreSQL are similar to regular tables but skip writing data to "write ahead logs" for better performance. If you create a trigger on an unlogged table, it should behave similarly to triggers on any other table:

  1. It becomes part of the whole transaction, so if the trigger fails, the entire transaction will be rolled back.
  2. It might make the transaction slower and more resource-intensive.

Using triggers often offers an acceptable tradeoff, but if possible, making the table logged might be easier:

ALTER TABLE <tablename> SET LOGGED;

If you still prefer using triggers, I recommend adding an exception logger to your trigger, especially during early testing. In Supabase, you can log errors using the "RAISE LOG" keywords. This helps quickly identify and resolve any problems, like permission issues. Alternatively, you can explore using PG_CRON, supported by Supabase, to intermittently monitor the table instead of using a trigger for every response added.

Regarding concerns about code changes, the current code is stable. If the extension meets your needs, you don't have to upgrade unless newer PostgreSQL versions become incompatible with the current extension library. Moreover, the potential future features I suspect may be implemented are relatively minor and should not conflict with your setup:

potential features/fixes:

  1. The "PENDING" status code should be valid but is not yet implemented in the code. This needs improvement.
  2. The tests for the code behave inconsistently due to a faulty testing API that times out before completing the tests.
  3. The POST request only supports JSON bodies, but it might be expanded to include other data types in the future.
  4. There may be plans to add PUT requests as an option.

Apart from fixing the "PENDING" bug, these changes should not significantly impact how you manage the extension. As far as I know, no one has immediate plans to resolve these issues, except for the 2nd issue which I'm currently working on.

vickkhera commented 1 year ago

With current supabase permission scheme, it is not possible to alter the table to logged. On localhost development, I can use the supabase_admin user, but not on the cloud hosted.

My biggest concern with the trigger approach is that the Postgres user does not have permission to drop or alter the trigger after it is created. I'm guessing related to the ownership. It confuses me why I can add the trigger but not delete it.

On localhost development, after supabase stop the result table is 100% of the time cleared. Doesn't matter how long the DB has been running. I found a discussion thread on hacker news with reference to some deep Postgres internals that sometimes the data never actually hits the disk. Not sure if that's related.

TheOtherBrian1 commented 1 year ago

@vickkhera

In regards to your original problem about relaxed documentation, I decided to do another deep dive into the extension code and found the following function call in worker.c:

DefineCustomStringVariable(
      "pg_net.database_name", 
       "database where the pg_net worker is connected",
        NULL,
        &database_name,
        "postgres",
        PGC_SIGHUP, 0,
        NULL, NULL, NULL);
}

DefineCustomStringVariable comes from the GUC.c library in PostgreSQL, which stands for Grand Unified Configuration. This code makes it possible to set the ttl variable in the postgres.conf file.

Looking further into the code, it seems that the extension creates 3 configurable variables:

  1. pg_net.batch_size (default: 200): An integer that limits the max number of rows that the extension will process from _net.http_requestqueue during each read
  2. pg_net.ttl (default: 6 hours): An interval that defines the max time a row in the _net._httpresponse will live before being deleted
  3. pg_net.database_name (default: 'postgres'): A string that defines which database the extension is applied to

All these variables can be viewed with the following command:

select * from pg_settings WHERE name LIKE 'pg_net%'

The postgres.conf, file can be found with the following SQL command:

SHOW config_file;

You can change the ttl variable by adding the following line to your postgres.conf file


pg_net.ttl =<"new ttl interval">

After saving the file, you can execute select pg_reload_conf() to update postgres.conf for your database. If the extension does not respond to the update, it may be necessary to restart your entire server.

You were right to report this issue. The documentation should have included instructions on configuration.

TheOtherBrian1 commented 1 year ago

As for your issue with adding a trigger in the supabase dashboard, that probably cannot be solved without the help of a Supabase team member. In October of 2022, the team announced a complete ban on Superuser access from the dashboard:

You will no longer be able to: create, alter, or drop tables, views, functions, triggers, sequences, and other entities in Supabase managed schemas, including extensions, graphql, realtime, and supabase_functions.

Supabase managed schemas are used to support platform features for all projects. Entities in these schemas are owned by supabase_admin role to prevent users from accidentally overriding them and breaking platform features. Unless explicitly granted, non-superuser roles cannot manage entities in Supabase managed schemas after the migration.

If you think modifying these schemas is necessary for your project, please contact support@supabase.io to explain your use case. We will try our best to accommodate your use case using alternative suggestions.

As you can imagine, this change caused some issues, including the exact issue you're having right now, as outlined by a disgruntled user, srowe0091

In the Dashboard I was testing out a trigger, and accidentally created it in the net._http_response table (because that is the default chosen table upon opening the side drawer.) It didn't show up in the trigger list page and when I realized where the trigger was, I tried to delete it in the SQL editor but was unable to do so because I get this error

ERROR: must be owner of relation _http_response

As far as I can tell, this user's specific issue was never resolved. However, other people in the thread discussed general work arounds that may be useful if you want to check it out. As said before, if you really need superuser access, the only people that can help are on the Supabase team.

While looking into this, I potentially uncovered another issue. It does not appear that dropped requests are entered into the net._http_response table. To clarify, requests with 400 or 500 responses are still inserted into the response table. If that's all you're interested in, I wrote into the documentation an awkward work around. But if you really need to track requests that failed to execute, that may not be possible.

According to this issue report, dropped requests were initially inserted into the table when the extension was written.

The response table is private API and subject to breakage, so we should probably avoid that. But yes, [errors] end up in the response table.

I don't think we log connection errors right now, only for debugging purposes.

However, a later enhancement modified the code:

To increase throughput, we could make the net tables UNLOGGED, remove indexes and PKs from them(this would also solve https://github.com/supabase/pg_net/issues/44).

An alternative could be logging the failed http requests. Then the user would have to search the pg logs.

The plan was to give users the option to log error messages to either a logger of file, but that plan was never implemented.

Yeah, that can be done by using a call to elog in the background worker.

Can we optionally log something based on the status code when the response is handled?

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default).

The _pg_net.log_httperrors variable that would be necessary to toggle logging on does not exist in worker.c. My guess is that the error loggers in the extension are remnants of early debugging, but they're not configured to write to any location. The errors are likely being dumped.

If you are willing to do the research to learn how to configure your instance to log the error messages, either to a csv or Supabase's logger, you should share how you achieved it.

Extension error logging in PostgreSQL is pretty esoteric. My background is too limited to offer any advice on it.

This is probably not the response you wanted, as I'm essentially saying your problem cannot be solved with the code as it is. Sorry for all the strife. It may make sense for you to raise an issue that solely focuses on error logging.

vickkhera commented 1 year ago

I really appreciate your response. I understand the design choices you mention, and they do make sense.

I have no issue creating the trigger on the table (I don't use the UI for anything I can do by code), the issue was to delete the trigger. In a thread on discord it was pointed out that dropping the function with cascade actually bypasses the permission to delete the trigger somehow. That's "safe" enough for me. This works on local dev; I haven't tried it on cloud supabase yet. I suppose I should :)

This is my use case, and I think it is worth considering as it lets me build up a reliable yet fast web hook:

I use a function to trigger my web hook which stores the request details in async_actions along with the request ID returned from the post function. My trigger on the result table just deletes the entry in async_actions table with the matching ID if the status is 200. This trigger is DEFERRABLE INITIALLY DEFERRED so it is non-blocking until the end of the transaction as well. I then have a sweeper running from cron that comes by every so often and resubmits any async actions that are older than a few minutes on the assumption that the hook failed because it wasn't removed by the trigger.

If adding the trigger is considered bad form, then I can just resort to polling the result table and doing my cleanup in batches. This opens the risk of re-running my hooks if the DB is shut down with un-reaped results. With the trigger method, that window is much much smaller. I'm sure other people have a need to notice that their web hooks actually succeeded and take action on that.

Writing to and parsing a log file is just asking for more fragility in the overall system so I'd prefer to avoid it.

TheOtherBrian1 commented 1 year ago

@vickkhera, hi I wanted to ask if your solution worked for your project or if you decided to switch to polling?

vickkhera commented 1 year ago

I did end up sticking with my trigger approach. I shared this on the Supabase discord yesterday, but I'll share it here too in case someone else stumbles across this. The benefit of using the trigger is that my "in flight" list of actions remains tiny. In my experiments using localhost supabase, every time I restart the DB the http result table is wiped and the sequence counter reset, so there's no way to reconcile the finished work.

I have a table called async_action into which I insert things I need to send to my queue running service, Inngest. The goal is not to lose any tasks due to network communications error, so I have a durable local log that only needs to exist until the remote service has accepted the work.

I use a function to prepare the task and record into this table to keep a durable log. The main lines from that function are these:

  net_http_id := net.http_post(inngest_url, inngest_data::JSONB);
  EXECUTE format('INSERT INTO async_task (action,context,net_http_id) VALUES (%L, %L::JSONB, %L) ON CONFLICT (action,context) DO NOTHING', job_type, inngest_data, net_http_id);

I could/should make the call to http_post() part of the insert trigger on that table, but due to evolution of the code I ended up doing it this way directly in the function. (hmmm... looking at this now I see a logic flaw where the on conflict will lose the new ID if there was already a pending task for the same request).

Next, I use a trigger function to notice the task has been acknowledged by the remote service, and remove the durable record as it is no longer needed.

CREATE FUNCTION notice_net_http_response() RETURNS trigger LANGUAGE plpgsql AS $$
 BEGIN
  -- NOTE: the http_response sequence resets on DB restart, so there is potential for old
  -- async jobs to have duplicated IDs. Hopefully in practice this shouldn't be a problem.
  DELETE FROM async_task WHERE net_http_id = NEW.id AND NEW.status_code = 200;
  RETURN NULL; -- this is an AFTER trigger
 END;
$$;

CREATE CONSTRAINT TRIGGER net_http_response_insert
 AFTER INSERT ON net._http_response
 DEFERRABLE INITIALLY DEFERRED
 FOR EACH ROW
 EXECUTE FUNCTION notice_net_http_response();

Finally, from cron every so often, I run function resubmit_all_failed_async_tasks(). Once every few minutes is sufficient for my needs. I still need to find a way to make that function run upon DB restart to reduce the risk of overlapping net_http_id's but that's a problem for "tomorrow me". I'm still learning what the proper delay is in my select -- currently I use 1 minute.

-- resubmit an async task to Inngest, under assumption that the http post failed
CREATE FUNCTION resubmit_async_task(task_id INTEGER) RETURNS BIGINT LANGUAGE plpgsql AS $$
 DECLARE
  inngest_url TEXT := inngest_submit_url();
  new_id async_task.net_http_id%TYPE;
 BEGIN
  RAISE DEBUG 'Resubmit async task=%', task_id;
  UPDATE async_task SET net_http_id = net.http_post(inngest_url, context) WHERE async_task_id = task_id RETURNING net_http_id INTO new_id;
  RETURN new_id;
 END;
$$;

-- only allow this function to be called by the service role.
REVOKE ALL ON FUNCTION resubmit_async_task FROM PUBLIC,authenticated,anon;
GRANT EXECUTE ON FUNCTION resubmit_async_task TO service_role;

-- re-submit all async tasks that failed to submit to Inngest (have not yet been cleaned up). we allow
-- 1 minute for the call to be processed before it is eligible for re-submission.
--
-- this is a maintenance function, so only the service role can call it.
CREATE FUNCTION resubmit_all_failed_async_tasks() RETURNS INTEGER LANGUAGE plpgsql AS $$
 DECLARE
  resubmit_count INTEGER;
 BEGIN
  PERFORM resubmit_async_task(async_task_id) FROM async_task WHERE created < CURRENT_TIMESTAMP - INTERVAL '1 minute';
  GET DIAGNOSTICS resubmit_count = ROW_COUNT;
  RETURN resubmit_count;
 END;
$$;

-- only allow this function to be called by the service role.
REVOKE ALL ON FUNCTION resubmit_all_failed_async_tasks FROM PUBLIC,authenticated,anon;
GRANT EXECUTE ON FUNCTION resubmit_all_failed_async_tasks TO service_role;

Arguably, these could be combined into one function but it allows me to debug and resubmit things individually if I need while developing code.