gocodebox / lifterlms-rest

LifterLMS REST API Feature Plugin
6 stars 7 forks source link

Setup webhook automatic retries #302

Open thomasplevy opened 1 year ago

thomasplevy commented 1 year ago

Currently our webhooks are set to track failed attempts and automatically disable after 5 failed delivery attempts:

https://github.com/gocodebox/lifterlms-rest/blob/9215df26a9a589d5a5488f019339dd83c1b3c15d/includes/abstracts/class-llms-rest-webhook-data.php#L289-L321

This would require 5 separate deliveries for 5 events though. EG: if it's triggered during student enrollment, the webhook will disable after the 5th student fails enrollment.

This is problematic if an external application relies on webhook deliveries to maintain data integrity in the external application.

We should automatically reschedule a failed webhook delivery on a delay (5-15 minutes? Possibly it should track individual event failures and increase the delay: failure 1, wait 5 minutes, resend; failure 2 wait 30 minutes and resend; etc...)

After failure, we can check whether the webhook should be disabled and, if not, we can reschedule it using the same args passed into the deliver() method: https://github.com/gocodebox/lifterlms-rest/blob/6bb9b04e2d7702fda708cb9b171c166b2006a099/includes/models/class-llms-rest-webhook.php#L39

In order to best track failures for the specific event we could probably just add an extra argument to the args array:

$args['retry_count'] ?? 0;
++$args['retry_count;
kirkroyster commented 1 year ago

Our concerns would be around network and/or server issues on our end. Network would be the primary, server would normally just be a scheduled restart or an OS update. Network could be a power outage which could be an extended outage which could take hours to restore << which suggests that we consider cloud hosting, understood. >> If it could extend out to 12 or 24 hours max for the retry, we should be good in almost any case. Simplistic store and forward, I know. But a significant impact on pseudo-guaranteed delivery and associated data integrity. Interested in your thoughts regarding what the practical maximum would be.

thomasplevy commented 1 year ago

@kirkroyster

Interested in your thoughts regarding what the practical maximum would be.

That's really going to depend on the requirements of your application. I think a "safe" method for the default with a simplistic option to reconfigure those defaults on a per-application need using hooks and filters is the best bet.

LifterLMS can't solve every potential need or requirement so doing something generally safe/appropriate as the default with the ability to customize when the defaults don't make sense is our approach to just about everything.

My initial thoughts are to follow a pattern similar to our retry rules for ecomm subscriptions in the core: https://github.com/gocodebox/lifterlms/blob/d3505995cfefe9ccda36155dc7a79d182aba5ec2/includes/models/model.llms.order.php#L998-L1019

Probably doing a shorter initial retry though basing off the assumption that momentary network issues could be the initial cause for the failure.

Attempt Retry 1 (+15 minutes) Retry 2 (+60 minutes) Retry 3 (+6 hours) Retry 4 (+12 hours)

After the 4th retry (the 5th actual attempt) the webhook would get disabled.

Maybe this is too short for a default. I'd need more feedback, I'm just talking right now, not committing code.

Of course this could be customized with a filter.