keygen-sh / keygen-api

Keygen is a fair source software licensing and distribution API built with Ruby on Rails. For developers, by developers.
https://keygen.sh
Other
699 stars 40 forks source link

Slow webhook event workers should not block other workers #771

Open ezekg opened 6 months ago

ezekg commented 6 months ago

We had an incident where a customer was sending upwards of 400 requests per second for a couple hours, resulting in a huge backlog of webhook events. This was fine, but their webhook endpoint that was subscribed to all events was starting to fail, causing all jobs to slow to a crawl while the hundreds of thousands of webhook events were attempted to be delivered, all with a 15 second timeout. This caused a severe backup in job queuing, which ultimately led to our Redis instance running out of memory until we were able to vertically scale the Redis instance and clear the backlog.

Webhook events should not block other jobs from running. Ideally, we rate limit webhook events per-account so that an account only ever delays its own events, not all events. To prevent the OOM issue, we should look into disabling webhook endpoints that have an error rate above a certain threshold, based on time and volume.

ezekg commented 6 months ago

The main culprit here is the fact we're storing webhook event payloads in the worker argument list. We should store the payload in the database instead of the args to reduce Redis memory usage.