go-vela / community

Community Information for Vela (Target's official Pipeline Automation Framework)
https://go-vela.github.io/docs/
Apache License 2.0
22 stars 3 forks source link

duplicate hook numbers sending concurrent webhooks #997

Open jbrockopp opened 2 weeks ago

jbrockopp commented 2 weeks ago

Description

There is a known issue that spamming webhooks can produce an error with duplicate build numbers for the server:

duplicate key value violates unique constraint "builds_repo_id_number_key"

To minimize the impact of this issue, we introduced some retry logic in the webhook workflow:

https://github.com/go-vela/community/issues/213#issuecomment-781655122

NOTE: The retry logic did not remove the behavior completely but it did lessen the occurrences of it.

However, recently we started noticing a slightly different flavor of this error specific to hooks:

{"error":"unable to create webhook OrgName/RepoName/BuildNumber: ERROR: duplicate key value violates unique constraint \"hooks_repo_id_number_key\" (SQLSTATE 23505)"}

We believe this is the same kind of underlying problem where concurrent webhooks are being processed and attempted to be inserted into the database for a repository with the same number. After some exploration, we didn't see any evidence that suggests we have the same retry logic in place for the hooks table like we setup for the builds table:

https://github.com/go-vela/server/blob/main/api/webhook/post.go#L261-L271

We should add additional logic to account for this problem for the hooks table like we did for the builds table.

Value

Useful Information

  1. What is the output of vela --version?

v0.23.4

  1. What operating system is being used?

https://www.flatcar.org/

  1. Any other important details?

The reports we've received thus far have been small so we're still attempting to gather additional data and information. These reports seem to be specific to "~10% of the time we'll notice that when we merge a pull request the push does not get processed by Vela". After some exploration, we've identified that those repositories have automatic branch deletion configured:

https://docs.github.com/en/enterprise-server@3.13/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-the-automatic-deletion-of-branches

On the GitHub side, we can confirm that this behavior causes two push events being fired at the same exact time:

Also, these reports are from a more recent time window of "about the last month or so". Unfortunately, we're not able to completely verify exactly when this happened because GitHub doesn't store webhook deliveries that far back.

jbrockopp commented 2 weeks ago

@Lxx-c @NaiveShiki we've deleted both of your comments as they seem like bot driven activity rather than real people:

image

However, if we misunderstood the intention of your comments, please feel free to contribute back to the discussion.

ecrupper commented 2 weeks ago

To add some additional context, I think the fact that this is a recent development points to this change as a potential culprit. Previously, delete pushes were promptly discarded and now we are creating a hook record for them.

I do wonder if this is yet another push (pun intended) to creating a job queue model for handling webhook payloads... I know this is what GitHub recommends:

In order to respond in a timely manner, you may want to set up a queue to process webhook payloads asynchronously. Your server can respond when it receives the webhook, and then process the payload in the background without blocking future webhook deliveries.

jbrockopp commented 1 week ago

@ecrupper thanks for that link and extra context!

We received some additional feedback that this behavior may have been occurring for longer than what we thought. From what we've gathered, the repos that had already opted in to the automatic branch deletion feature within GitHub have been experiencing this issue for quite some time ("months"). So the "about a month or so" window we received from this most recent report was the first we had heard of it but other people started chiming in with their own experiences after.

Given all of this, we agree what you shared appears to be the most likely culprit because it was introduced in the v0.23 release of the Vela server and the timelines we've gathered from all of these reports matches around the same time we performed our upgrade internally to that version. We briefly discussed this internally and also agree that this further elevates the need to create a different method for processing webhooks and following GitHub's recommendations sounds like the right path forward for that.

In the meantime, we are planning to add retry logic that follows the same pattern as the build number issue we've faced in the past. We acknowledge that this is a "bandaid fix" but our thought process is it seems like a lower effort code introduction that should workaround the issue and ideally mitigate it from happening completely. This should enable us sufficient time to create a better pattern for processing webhooks in a queue based model like GitHub recommends.

jbrockopp commented 1 week ago

A PR has been added to mitigate the frequency of this issue:

However, we aren't closing this ticket because it's still possible for it to show up similar to:

https://github.com/go-vela/community/issues/213