Add retries - Githubissues

steve-chavez commented 11 months ago

Hey there! I am wanting a new feature that doesn't seem to be present in supabase regarding webhooks. Sometimes my API can get backed up or is down for a few seconds, and webhooks may fail. Some functions rely on webhooks to work and provide accurate data to customers, so like Stripe does, I would like an automated retry system.

If a webhook fails, it will automatically try in a time like 30 minutes. If that fails, it retries in an hour, maybe to 5 retries total. That way, no webhook is missed.

Originally posted by @Wamy-Dev in https://github.com/supabase/supabase/discussions/17664

steve-chavez commented 11 months ago

This is already possible with some work as demonstrated in https://github.com/supabase/pg_net#retrying-failed-requests but it would be better to have retries built-in

Curl does retries https://everything.curl.dev/usingcurl/downloads/retry but libcurl doesn't have this functionality, we need to implement our own.

TheOtherBrian1 commented 11 months ago

In Issue 62, you mentioned the following:

An alternative [to error recovery] could be logging the failed http requests. Then the user would have to search the pg logs.

Originally, the plan was to provide users with the option to log error messages to a designated logger for handling:

... We [can] optionally log something based on the status code when the response is handled?

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default). -Issue 49

However, this feature has not been implemented. The config variable for enabling logging is absent in the worker.c codebase. Nonetheless, there may be potential benefits in expanding upon this concept.

Rather than enabling the recording of request failure messages in a log file, the config variable could instruct the extension to automatically reintegrate all dropped requests back into the "net.http_request_queue" table. This approach, unlike the existing solution, would demand the addition of just a single variable and the replacement of one of the error logs with the C equivalent of an INSERT statement.

It is important to emphasize that this suggestion would exclusively monitor requests that failed due to an extension-related error. Requests resulting in 400 or 500 HTTP status errors are already logged in net._http_response and should be managed by users. Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts. For users wishing to oversee these specific failed requests, the existing README suggestion provides a safer mechanism, albeit with greater overhead.

steve-chavez commented 11 months ago

@TheOtherBrian1 Great points.

Requests resulting in 400 or 500 HTTP status errors are already logged in net._http_response and should be managed by users. Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts

True. I think the only safe way to auto-retry would be by following the Retry-After header, however I don't think all services implement it.

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default). -https://github.com/supabase/pg_net/issues/49#issuecomment-948844080 However, this feature has not been implemented.

Agree, that would be higher priority than this idea.

Rather than enabling the recording of request failure messages in a log file, the config variable could instruct the extension to automatically reintegrate all dropped requests back into the "net.http_request_queue" table It is important to emphasize that this suggestion would exclusively monitor requests that failed due to an extension-related error.

Hm, do you have examples of extension-related errors? I don't recall one. IIUC it's an internal error rather than a 400 bad request for example.

TheOtherBrian1 commented 11 months ago

Hm, do you have examples of extension-related errors?

I haven't experienced any yet, but I've thought about the potential for them to happen.

steve-chavez commented 11 months ago

Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts.

we run the retries on an exponential scale that tries with increasing periods, that ensure resilience to up to 24 hours, in case of connectivity downtime due to updates, issues etc.

^ From https://github.com/orgs/supabase/discussions/17664#discussioncomment-7208243

We could cap the retrying as suggested above, that way we don't do endless loops.

lookevink commented 4 weeks ago

Currently using Supabase to fire off webhooks to vercel endpoints.

Vercel has been running into issues with Undici when there's CPU contention, so making retries w/ exponential backoff almost mandatory to be used for event-driven applications.

Wasn't quite sure how to implement the official retry solution: https://github.com/supabase/pg_net#retrying-failed-requests

Instead used QStash. so far working as a charm

supabase / pg_net

Add retries #110