supabase / pg_net

A PostgreSQL extension that enables asynchronous (non-blocking) HTTP/HTTPS requests with SQL
https://supabase.github.io/pg_net
Apache License 2.0
200 stars 14 forks source link

Webhooks fail when multiple inserts are done #86

Open devdras opened 1 year ago

devdras commented 1 year ago

Bug report

Describe the bug

I have a webhook configured to watch a table for inserts. On an insert a POST request should be fired to an API hosted on Vercel. This works when one insert operation is carried out. If multiple inserts are carried out: .insert([{message: 1}, {message: 2}, {message: 3}]) then the webhooks aren't fired at all. If instead we carry out multiple inserts by calling the Supabase API multiple times then the webhook is fired for only some of the inserts.

To Reproduce

Steps to reproduce the behavior, please provide code snippets or a repository:

  1. Configure a webhook on a table to watch for insert operations and call an external API

  2. Use the Supabase JS library to perform multiple inserts like: let { data, error } = await supabase .from("example_table") .insert([{message: 1}, {message: 2}, {message: 3}, {message: 4}, {message: 5}, {message: 6}, etc ]) .select("message");

  3. Check logs for external API to confirm it didn't receive any requests

  4. See error

Something to note is that I was inserting 32 things at once.

Expected behavior

For each insert, I would expect the external API to be called. It would also help if there was a log in Supabase that showed what happens when the webhook is run.

Screenshots

If applicable, add screenshots to help explain your problem.

System information

Additional context

Add any other context about the problem here.

danrasmuson commented 1 year ago

Thank you for posting this. I'm receiving a similar error.

Surprisingly when I create 10 records at the same time it fires the function for each 10 records...

INSERT INTO public.profiles (id, first_name, last_name, stripe_customer_id, email)
SELECT u.id, NULL, NULL, NULL, u.email
FROM auth.users u
LEFT JOIN public.profiles p ON u.id = p.id
WHERE p.id IS NULL
LIMIT 10;

However, if I fire the function for 100 records...

INSERT INTO public.profiles (id, first_name, last_name, stripe_customer_id, email)
SELECT u.id, NULL, NULL, NULL, u.email
FROM auth.users u
LEFT JOIN public.profiles p ON u.id = p.id
WHERE p.id IS NULL
LIMIT 100;

It fails

devdras commented 1 year ago

Yup, basically the webhook behaviour is flakey. I haven't tried it recently. I wanted to use this to build an event driven architecture but can't without it being reliable. Ended up using QStash as a replacement for the webhook. Would love Supabase to have some kind of full featured event queue like SQS/Eventbridge/Service Bus.

steve-chavez commented 1 year ago

We've tested pg_net sending 1K inserts per second, this to a local nginx server though.

@devdras @danrasmuson Which webhook service are you using? Likely it's rate limiting the requests.

danrasmuson commented 1 year ago

We've tested pg_net sending 1K inserts per second, this to a local nginx server though.

@devdras @danrasmuson Which webhook service are you using? Likely it's rate limiting the requests.

Im using vercel with serverless functions

matt-aitken commented 10 months ago

We're experiencing the same problem. When inserting records in sequence all the webhooks are received. But if we insert 30 rows in parallel none of the webhooks send and the net/_http_response table is full of Timeout was reached errors.

steve-chavez commented 9 months ago

@matt-aitken Are you using vercel serverless functions too?

lohrmann commented 9 months ago

@steve-chavez We're able to reliably reproduce it based on the number of new entries that get bulk added to the table—have been in touch with your support team about this as well, but unable to resolve it. 

Happy to prepare a screen-recording or do a walk through if helpful.

steve-chavez commented 9 months ago

The lambda function does not record any invocations if the number of records inserted is larger than >150

@lohrmann That might be related to lambdas rate limiting https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#api-requests

We're able to reliably reproduce it based on the number of new entries that get bulk added to the table

By default there are 200 requests per iteration https://github.com/supabase/pg_net/blob/master/src/worker.c#L479-L483

I'll make this parameter user configurable.

steve-chavez commented 9 months ago

Curl has:

curl --rate 2/s -O https://example.com/[1-100].jpg

https://everything.curl.dev/usingcurl/transfers/request-rate#examples

I think that should be the final solution here.


Note: the above is not available on libcurl.

matt-aitken commented 9 months ago

Sorry for the slow response on this. No, we're receiving webhooks to AWS EC2 (a long-running server).

There's nothing in our server logs where we receive requests from Supabase (when inserting many rows) and the net/_http_response table is full of Timeout was reached errors.

We (Trigger.dev) make it easy for our customers to subscribe to these Supabase database webhooks and quite a few people are hitting this problem now.

ahanusek commented 7 months ago

We're experiencing the same problem. When inserting records in sequence all the webhooks are received. But if we insert 30 rows in parallel none of the webhooks send and the net/_http_response table is full of Timeout was reached errors.

I have the same issue. I'm using Google Cloud Functions v2 (Cloud Run). Table is full of Timeout was reached, but in Google Cloud Monitoring don't have any logs.

reedshea commented 7 months ago

I'm running into this issue as well. I spun up a clean environment to reproduce, and did some testing. Supabase project qutskxfebtlgwyzsonvd. The setup is:

I ran a handful of tests (spreadsheet here, screenshot below). It looks like a pretty linear relationship between the number of requests (in my case, the number of users for which the cron job inserts items into the queue) and the webhook timeout configuration. By my math, the webhook needs about 33.33ms per item to process. If only 20ms per item is available within the webhook timeout configuration, most or all of the events fail to be sent to the API Gateway.

Unlike the note from @lohrmann above, I did see some partial successes.

Screenshot 2024-01-06 at 6 22 11 PM
steve-chavez commented 6 months ago

@reedshea Thanks for the detailed information.

@TheOtherBrian1 Shared a reproducible example: https://gist.github.com/TheOtherBrian1/e479426f0070d5f0879483cf59f85dfb

The setup is basically:

CREATE TABLE webhook_test (
    id SERIAL,
    sample text
);

CREATE OR REPLACE FUNCTION debugging_example()
RETURNS TRIGGER
AS $$
DECLARE
    -- Store payload data
    row_data_var JSONB;
    row_id INT;
BEGIN
    -- Retrieve row data and convert to JSON
    SELECT TO_JSONB(NEW.*) INTO row_data_var;
    -- Initiate HTTP POST request to URL
    SELECT
        net.http_post(
            url := 'https://postman-echo.com/post',
            -- Use row data as payload
            body := row_data_var
        ) INTO row_id;

    RAISE LOG 
        'LOGGING WEBHOOK |
        The Webhook was called after %s |
        ID = %s |
        Logging an entire row as JSON (%)', (SELECT NOW()), row_id, row_data_var;
    -- Return the row data to inspect
    RETURN NEW;
-- Handle exceptions here if needed
EXCEPTION
    WHEN others THEN
        RAISE EXCEPTION 'LOGGING WEBHOOK FAILED: An error occurred: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER webhook
    AFTER INSERT ON webhook_test
    FOR EACH ROW
    EXECUTE FUNCTION debugging_example();

CREATE OR REPLACE FUNCTION insert_test_data(num_rows INT)
RETURNS VOID
AS $$
DECLARE
    i INT;
BEGIN
    FOR i IN 1..num_rows LOOP
        INSERT INTO webhook_test(sample) VALUES (i || 'Call');
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Then execute a request:

INSERT INTO webhook_test (sample)
VALUES ('test value');

Then a batch of requests:

SELECT insert_test_data(100);

Locally, all requests succeed fast:

select count(*) from net._http_response where error_msg is null;
-[ RECORD 1 ]
count | 101

select count(*) from net._http_response where error_msg is not null;
-[ RECORD 1 ]
count | 0

But, for some reason, on Supabase cloud only the single request succeeds and the batched ones fail:

select count(*) from net._http_response where error_msg is null;
count
1
select count(*) from net._http_response where error_msg is not null;
count
100

Will debug this further.


I should add that Supabase is on 0.7.1, the latest version is 0.7.3. I've tried 0.7.1 locally and the result is the same.

steve-chavez commented 6 months ago

By increasing the default timeout (2 seconds) to 4 seconds, all the above requests succeed on Supabase cloud:

        net.http_post(
            url := 'https://postman-echo.com/post',
            -- Use row data as payload
            body := row_data_var,
            timeout_milliseconds := 4000
        ) INTO row_id;
SELECT insert_test_data(100);

select count(*) from net._http_response where error_msg is null;
count
100

Will increase the default to 5 seconds.

vickkhera commented 6 months ago

Will increase the default to 5 seconds.

Is that setting the timeout for the whole batch? It seems based on @reedshea's graph, the whole batch needs enough time per call. Or is he setting some other timeout?

vickkhera commented 6 months ago

I just ran into this with my staging environment. I updated a table which caused 121 events to need to be published, all in one transaction (single UPDATE statement). All 121 http_post requests fail with timeout error after ~2 seconds. I bumped the timeout_milliseconds to 4000 but that wasn't sufficient. I had to bump it to 6000 to allow all of them to succeed. I'm submitting task requests to Inngest, which responds really fast.

I think the right thing is to multiply the timeout by the number of requests per batch if it is not possible to make it actually be a per-request timeout. Passing the timeout on a per-request basis gives the developer the expectation that it applies to that one request, not the entire batch in which that request happens to be included by the net extension batching algorithm.

What happens if different requests get passed different timeouts? Which one is used for that batch?

TheOtherBrian1 commented 6 months ago

@vickkhera I wasn't actually aware of the cause of the bug - I was only aware of how to reproduce it - but you seem to have noticed that the timeout is caused when a batch (default 500) does not complete in the set timeout period. A later batch would actually succeed if its queue were smaller or the timeout were higher.

I assumed that the timeout setting was for an individual cURL requests and not the entire batch. @steve-chavez, is this the intended behavior?

steve-chavez commented 5 months ago

The timeouts are per-request. Will add a test to confirm soon.

vickkhera commented 5 months ago

The timeouts are per-request. Will add a test to confirm soon.

There's no way my requests were taking 2s each. The whole batch timed out in ~2-3 seconds.

andrueandersoncs commented 5 months ago

We're running into this same issue. "net._http_response" table shows the "Timeout was reached" error yet the "timed_out" column remains null. This is happening with batches of 8 or less, and it seemed like after the initial failure subsequent inserts into the table with the trigger failed even with a batch of 1.

ham-evans commented 4 months ago

I'm having this problem as well. Can reproduce it based on the number of new entries that get bulk added to the table. Around 10 work, more than that get a significant % of errors

All using pgnet and supabase webhooks. Would love if theres some solution? Using the lambda to then load SQS. Thinking of skipping this step now...

waylon999 commented 4 months ago

I'm having this problem as well. Can reproduce it based on the number of new entries that get bulk added to the table. Around 10 work, more than that get a significant % of errors

  • The webhook trigger is calling an AWS Lambda function (via Gateway).
  • The lambda function does not record any invocations if the number of records inserted
  • It's not the case that only some of the invocations don't come through
  • It's not the case that the invocation reaches the Lambda function and the webhook times out before the function responds
  • Other times it doesn't time out and just returns a 500 with "Internal Server Error"
  • In supabase, the _http_response table logs everything

All using pgnet and supabase webhooks. Would love if theres some solution? Using the lambda to then load SQS. Thinking of skipping this step now...

I am doing something similar, except API Gateway populates SQS which then calls a lambda function. Worked fine on small data sets but when I had to make over 100 calls I saw similar behavior as to what you (and others) describe. I set the timeout to 5 seconds and it is working now for me, although that feels very fragile. Who knows if it will work if I have a big burst of calls I need to make. The UI says 5000 ms is the max timeout, but I believe you can set it higher than that if you just create the function via SQL editor, though I'm not sure if that has any side effects.

ham-evans commented 4 months ago

I'm having this problem as well. Can reproduce it based on the number of new entries that get bulk added to the table. Around 10 work, more than that get a significant % of errors

  • The webhook trigger is calling an AWS Lambda function (via Gateway).
  • The lambda function does not record any invocations if the number of records inserted
  • It's not the case that only some of the invocations don't come through
  • It's not the case that the invocation reaches the Lambda function and the webhook times out before the function responds
  • Other times it doesn't time out and just returns a 500 with "Internal Server Error"
  • In supabase, the _http_response table logs everything

All using pgnet and supabase webhooks. Would love if theres some solution? Using the lambda to then load SQS. Thinking of skipping this step now...

I am doing something similar, except API Gateway populates SQS which then calls a lambda function. Worked fine on small data sets but when I had to make over 100 calls I saw similar behavior as to what you (and others) describe. I set the timeout to 5 seconds and it is working now for me, although that feels very fragile. Who knows if it will work if I have a big burst of calls I need to make. The UI says 5000 ms is the max timeout, but I believe you can set it higher than that if you just create the function via SQL editor, though I'm not sure if that has any side effects.

yea i have it set to 5000ms and still having the problem. Agree it seemed fragile regardless. would love to know if doing via SQL Editor solves?

Otherwise gonna basically skip the webhook altogether as can't risk it in production

TheOtherBrian1 commented 4 months ago

@waylon999 You can set the timeout to greater than 5000ms. That's a limitation imposed by the UI. You can write the webhooks directly in pg_net or use the supabase_functions.http_request helper function.

ahanusek commented 4 months ago

This bug has been open for a year, and it's really frustrating that issues like this are not properly addressed by the Supabase team. They are adding other advanced features like branching, but the core remains unstable, and I don't see a significant effort to fix that. I migrated a relatively small project from Firestore/Firebase to Supabase because I believed in their marketing, and now I really regret this decision. Currently, I'm paying more for significantly worse quality:

But yes, Supabase is open-sourced. I guess I'll tell my clients - our app doesn't work properly, but at least you can read the source code.

ham-evans commented 4 months ago

@waylon999 You can set the timeout to greater than 5000ms. That's a limitation imposed by the UI. You can write the webhooks directly in pg_net or use the supabase_functions.http_request helper function.

does this solve the problem though? Seems fragile regardless of timeout (esp if unclear if timeout will be 5 seconds or 30)

joaopegoraro commented 3 months ago

Any new updates on this?

robertbodley commented 2 months ago

Any new updates on this?

kevinlooESO commented 1 month ago

pg_net has been super inconsistent. also tends to hang up sporadically. tried to build an event-driven app but decide to give up & rely on cron job instead

matiastucci commented 1 month ago

Is there any ETA on this? It's blocking me and I would need to do some workaround. Thanks!

ExplosiveJam commented 1 month ago

Experiencing same issue. I increased timeout to 30s and all requests ended up with either "SSL connect error" or "Error in the HTTP2 framing layer". In my table I have raw and parsed response from OpenAI. Going to try to rewrite the trigger with plain net.http_post instead of supabase_functions.http_request. Probably the payload is too big and I can shrink it by constructing it manually (I do not need OLD record). Another idea is to write statement trigger, so I have one request for batch update.

uncvrd commented 3 weeks ago

I'm unfortunately dealing with this as well. For me it's when there are 100 concurrent events (like bulk deleting 100 rows). I'm curious what alternatives you all are using for listening to postgres events and performing async actions? I saw someone mention QStash but not sure how that plugs in to the event flow since you would still need to make a pg_net request to your redis backend based on the event...right?

EDIT:

Has anyone implemented this as an alternative?

https://supabase.com/docs/guides/realtime/subscribing-to-database-changes

Theoretically I could just subscribe to the same tables in my backend and therefore only need to use 1 (or however many servers I deploy) persistent realtime connection for the same features? I suppose the issue would be concurrency since my two backend servers would receive the same event notification....

devdras commented 3 weeks ago

I'm unfortunately dealing with this as well. For me it's when there are 100 concurrent events (like bulk deleting 100 rows). I'm curious what alternatives you all are using for listening to postgres events and performing async actions? I saw someone mention QStash but not sure how that plugs in to the event flow since you would still need to make a pg_net request to your redis backend based on the event...right?

EDIT:

Has anyone implemented this as an alternative?

https://supabase.com/docs/guides/realtime/subscribing-to-database-changes

Theoretically I could just subscribe to the same tables in my backend and therefore only need to use 1 (or however many servers I deploy) persistent realtime connection for the same features? I suppose the issue would be concurrency since my two backend servers would receive the same event notification....

I just used a queue. I've used Upstash, BullMQ and Inngest. All are good. Whenever you update/insert the piece of data that you wanted to trigger, send an event to your queue. Your queue then does the thing you want to do. Basically my solution was to not use Supabase Webhooks. Imo they're useless without some kind of queue functionality backing them up.

uncvrd commented 3 weeks ago

@devdras Thanks for the quick response! I guess the issue that I'm having with that approach is that I need to listen for events from records that are deleted based on cascade events (say you have deleted an entity that has a fkey to a different table with onDelete CASCADE). Having to keep track of when a secondary entity is also removed when removing the original entity quickly becomes really messy.

That's why listening to the database row level events has been super powerful. A single source of truth that sends updates when a row is removed (whether directly or indirectly)

My guess is that I would need to use a KV store like Upstash or NATS to create a "lock" on an entity (based on the table and id) where you can only write to the queue if that key does not exist in the KV store. Maybe with having the key TTL being a couple minutes to remove from memory

Maybe that's what you're doing and now I'm understanding or not haha

Sorry to slightly derail the goal of this ticket but I hope it helps for people arriving at the same issue currently

devdras commented 3 weeks ago

@uncvrd I'd probably use a postgres function for that 🙂

vickkhera commented 3 weeks ago

I just used a queue. I've used Upstash, BullMQ and Inngest. All are good. Whenever you update/insert the piece of data that you wanted to trigger, send an event to your queue. Your queue then does the thing you want to do. Basically my solution was to not use Supabase Webhooks. Imo they're useless without some kind of queue functionality backing them up.

My problem comes from reprocessing my internal queue which feeds Inngest... If I notice a request did not get accepted by Inngest, I retry it, but this is done with a single SQL statement more or less SELECT resubmit_to_innngest(data) from queue_table where .... When there are too many pg_net seems to just drop them so I have to retry yet again!

At some point pg_net has to be more reliable so I don't have to break this up into submitting no more than some small number at a time.

uncvrd commented 3 weeks ago

@devdras hmm but then we're back to using pg_net within the postgres function so I can make calls to external APIs (for example I need to delete Typesense search documents which are not a part of the Postgres database when certain rows are deleted directly or indirectly)

anyways, I don't want to derail the ticket anymore. But I'll keep searching for a solution

GaryAustin1 commented 2 weeks ago

I was just testing using pg_net to realtime broadcast endpoint. A bulk insert on a table of 20 rows works. 40 rows does not with timeout for all. Simple SQL:

insert into cats (count)
select * from generate_series (2300,2350)

The trigger function call the realtime broadcast end point.

There is a 2nd issue in that it appears pg_net does not guarantee order. I'm assuming that it sends the http requests async (versus it is just async back to the SQL caller). Because of this rapidly sent http requests can arrive out of order at their destination.