readthedocs / readthedocs.org

The source code that powers readthedocs.org
https://readthedocs.org/
MIT License
7.99k stars 3.58k forks source link

Build: disable webhooks after N failed builds in a row #9690

Open humitos opened 1 year ago

humitos commented 1 year ago

Analyzing the data we have, I've found that we have a bunch of projects that have been triggering builds where all of them have failed. During months. This consumes resources in our side, but also degrade the UX for valid projects since they have to wait for a build that we already know it's gonna fail, before their project's build is taken.

So, to reduce projects in this scenario, we are thinking about "auto-disable webhooks on projects that have N failed builds in a row" and communicate their maintainers. To re-enable the webhook, they would have to go to the project's admin page and re-activate/re-configure/re-add the webhook.

This is the metabase question for future reference: https://ethicalads.metabaseapp.com/question/309-project-with-all-failed-builds-in-the-last-3-months. It shows the amount of builds and minutes consumed per project.

agjohnson commented 1 year ago

Also, it would be great to have a maintainers site notification specific to this that goes out, so we can re-use this pattern when soft-disabling a project. A hard disable (setting Project.skip = True) can be reserved for abusive/spam projects, but that "notification" pattern is quite broken for normal users.

Or perhaps I'm describing two potential notifications -- a specific notification and a generic notification?

humitos commented 1 year ago

Yeah, I'm fine with better notifications. Maybe we want to make that part of the work on #9279 and #3399

We need to probably set two different thresholds here:

  1. N consecutive builds on branch/tags (lower, maybe 25) and disable the webhook completely
  2. N consecutive builds on PR builds (upper, maybe 50) and only disable PR building

I'm trying to say that "builds failing on PRs are more acceptable than failing on branch/tags and should be handled differently"

agjohnson commented 1 year ago

Yeah, great point on PR builds. Maybe we are strictly only concerned with failed active versions?

I agree we can rethink notifications in a larger PR. If we add something here it will be using the new notification pattern anyways. Some of the old patterns can eventually be culled, and just adding a new mechanism for disabling projects here would be enough to start that process.

humitos commented 1 year ago

@agjohnson

Maybe we are strictly only concerned with failed active versions?

It makes sense to me, yeah.

humitos commented 1 year ago

I'd like to think about prioritizing this feature if possible because this will help us a lot to know "how many active/valid non-spam projects do we have in our platform" in an easier way. This data will be useful when migrating/deprecating things as we are doing with "building without a config file" since we will know "how many of the active/valid projects have already migrated".

Many non-spam projects will not migrate to the new configuration file just because they don't need it or have to. Mainly projects that are archived or where created just for testing/development/educational purposes, for example. I assume we have a lot of them.

The work required that I see here is:

agjohnson commented 1 year ago

Because taking automated actions on projects has been hard to get right -- spam banning, build retry -- maybe we should start fairly conservatively here. I think I'm comfortable saying that any project who has a default version that has 25 consecutive failed builds is probably not monitoring their project.

We also don't have to take any automated action yet, we could start with just a notification or drip campaign.

At least disabling the webhook is non-destructive though, I think that is probably the key for us. Deleting the webhook is more destructive and I'd want to make sure we're not making more work for us here.

In the end, this is mostly a cost saving measure for instance cost, but that is not a strong concern right now. I'd happily increase our instance costs and pay for more instances if it frees up our time or makes for less headaches for us.

humitos commented 6 months ago

We also don't have to take any automated action yet, we could start with just a notification or drip campaign.

We can trigger a notification and attach it to the Project using the new system now 💯

At least disabling the webhook is non-destructive though, I think that is probably the key for us. Deleting the webhook is more destructive and I'd want to make sure we're not making more work for us here.

Sounds good to me. I think adding Integration.enabled(bool, default=True) field here and expose it in the form would be enough to implement this feature. This allows us to mark it as enabled=False via code and let users to re-enable it via the WebUI using the form.

In the end, this is mostly a cost saving measure for instance cost, but that is not a strong concern right now. I'd happily increase our instance costs and pay for more instances if it frees up our time or makes for less headaches for us.

This is true. However, "disabling projects their authors are not monitoring/using" seems like a good way to remove them when getting metrics from our database --which is useful to make decisions based on data.