palantir / policy-bot

A GitHub App that enforces approval policies on pull requests
Apache License 2.0
755 stars 103 forks source link

Github hook error with some PRs update #135

Open clakech opened 4 years ago

clakech commented 4 years ago

Hi,

Policy bot is working fine since more than 6 months and we face an issue when we create pull requests (~200/250 files) from a release branch to master branch: "panic: runtime error: invalid memory address or nil pointer dereference"

The stack trace is really strange and we do not understand what is going on. We suspect an issue with the githubapp hook that send logs/metrics when we try to get datas from github membership config.

For the same hook message, sometimes it works fine, sometime it fails. (we can retry previous event hook form github UI and check roundtrip result)

Here is one stacktrace:

2019-11-12T15:12:21.654223183Z |DEBUG| rule evaluation resulted in pending:"0/1 approvals required" github_delivery_id=cc3f8100-055e-11ea-8437-c05dd1c52794 github_event_type=pull_request github_installation_id=1007339 github_pr_num=1183 github_repository_name=yann-platform github_repository_owner=yannou github_sha=985b2acc17e97f137119ff4a2382e5a04ffd5163 rid=bn5cnj2ara13u0ug6k50 rule="PMs must approve PR to master branch"
2019-11-12T15:12:21.654289960Z |INFO| Setting "policy-bot: master" status on 985b2acc17e97f137119ff4a2382e5a04ffd5163 to pending: 0/1 rules approved github_delivery_id=cc3f8100-055e-11ea-8437-c05dd1c52794 github_event_type=pull_request github_installation_id=1007339 github_pr_num=1183 github_repository_name=yann-platform github_repository_owner=yannou github_sha=985b2acc17e97f137119ff4a2382e5a04ffd5163 rid=bn5cnj2ara13u0ug6k50
2019-11-12T15:12:22.184688541Z |ERROR| Unhandled error while serving route error="panic: runtime error: invalid memory address or nil pointer dereference\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:522\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:82\nruntime.sigpanic\n\t/usr/local/go/src/runtime/signal_unix.go:390\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.cacheControl.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/client_creator.go:344\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.roundTripperFunc.RoundTrip\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/middleware.go:151\ngithub.com/palantir/policy-bot/vendor/github.com/gregjones/httpcache.(*Transport).RoundTrip\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/gregjones/httpcache/httpcache.go:214\ngithub.com/palantir/policy-bot/vendor/github.com/bradleyfalzon/ghinstallation.(*Transport).RoundTrip\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/bradleyfalzon/ghinstallation/transport.go:94\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.ClientMetrics.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/middleware.go:64\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.roundTripperFunc.RoundTrip\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/middleware.go:151\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.ClientLogging.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/middleware.go:121\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.roundTripperFunc.RoundTrip\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/middleware.go:151\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.setInstallationID.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/client_creator.go:369\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.roundTripperFunc.RoundTrip\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/middleware.go:151\nnet/http.send\n\t/usr/local/go/src/net/http/client.go:250\nnet/http.(*Client).send\n\t/usr/local/go/src/net/http/client.go:174\nnet/http.(*Client).do\n\t/usr/local/go/src/net/http/client.go:641\nnet/http.(*Client).Do\n\t/usr/local/go/src/net/http/client.go:509\ngithub.com/palantir/policy-bot/vendor/github.com/google/go-github/github.(*Client).Do\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/google/go-github/github/github.go:513\ngithub.com/palantir/policy-bot/vendor/github.com/google/go-github/github.(*RepositoriesService).CreateStatus\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/google/go-github/github/repos_statuses.go:81\ngithub.com/palantir/policy-bot/server/handler.(*Base).postGitHubRepoStatus\n\t/go/src/github.com/palantir/policy-bot/server/handler/base.go:115\ngithub.com/palantir/policy-bot/server/handler.(*Base).PostStatus\n\t/go/src/github.com/palantir/policy-bot/server/handler/base.go:98\ngithub.com/palantir/policy-bot/server/handler.(*Base).EvaluateFetchedConfig\n\t/go/src/github.com/palantir/policy-bot/server/handler/base.go:199\ngithub.com/palantir/policy-bot/server/handler.(*Base).Evaluate\n\t/go/src/github.com/palantir/policy-bot/server/handler/base.go:150\ngithub.com/palantir/policy-bot/server/handler.(*PullRequest).Handle\n\t/go/src/github.com/palantir/policy-bot/server/handler/pull_request.go:52\ngithub.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp.(*eventDispatcher).ServeHTTP\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/palantir/go-githubapp/githubapp/dispatcher.go:178\ngithub.com/palantir/policy-bot/vendor/goji%2eio.dispatch.ServeHTTP\n\t/go/src/github.com/palantir/policy-bot/vendor/goji.io/dispatch.go:17\ngithub.com/palantir/policy-bot/vendor/github.com/bluekeyes/hatpear.Recover.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/bluekeyes/hatpear/hatpear.go:107\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:1995\ngithub.com/palantir/policy-bot/vendor/github.com/bluekeyes/hatpear.Catch.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/bluekeyes/hatpear/hatpear.go:60\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:1995\ngithub.com/palantir/policy-bot/vendor/github.com/rs/zerolog/hlog.AccessHandler.func1.1\n\t/go/src/github.com/palantir/policy-bot/vendor/github.com/rs/zerolog/hlog/hlog.go:180" method=POST path=/api/github/hook rid=bn5cnj2ara13u0ug6k50
2019-11-12T15:12:22.185130326Z |INFO| http_request elapsed=9664.741379 method=POST path=/api/github/hook rid=bn5cnj2ara13u0ug6k50 size=33 status=500
2019-11-12T15:12:22.697482102Z |INFO| Received webhook event github_delivery_id=d352d410-055e-11ea-9e02-785d94577c26 github_event_type=status rid=bn5cnliara13u0ug6k60
2019-11-12T15:12:22.698192766Z |INFO| http_request elapsed=0.8035 method=POST path=/api/github/hook rid=bn5cnliara13u0ug6k60 size=0 status=200

We use latest version 1.12.3 of policy-bot deployed with docker/k8s on GCP with a traefik in between.

On fail we can found an error in github UI, in the policy bot app config advanced recent deliveries: "We couldn’t deliver this payload: Service Timeout"

Any clue would help here.

Thank you for

clakech commented 4 years ago

Ok. GitHub expect a response I less than 10 seconds https://developer.github.com/v3/guides/best-practices-for-integrators/

In our case, policy-bot is too slow here.

Any clue to avoid this? Async config to run hook later would help.

bluekeyes commented 4 years ago

Thanks for reporting this. I'm not sure about the 10 second timeout (which may be a different issue), but the initial error looks very similar to an issue on our Bulldozer application: https://github.com/palantir/bulldozer/issues/149.

In that case, I found and fixed a potential race condition (https://github.com/palantir/go-githubapp/pull/24) in the shared library used by both of these apps, but it seems like that was not the problem, and I haven't had a chance to debug further. Building a test binary with Go's race detector is probably the next step.

clakech commented 4 years ago

This 10 seconds requierement is documented in the github documentation:https://developer.github.com/v3/guides/best-practices-for-integrators/

GitHub expects that integrations respond within 10 seconds of receiving the webhook payload. If your service takes longer than that to complete, then GitHub terminates the connection and the payload is lost.

Thank you for the pointer to this bulldozer issue. (we are using bulldozer too) Maybe we have a cache issue here and we should not call github APIs so many times OR maybe we need to invalidate the cache when we change some github teams/config. Let's investigate deeper on this issue.

clakech commented 4 years ago

BTW, Github recommandation suggest to use another software architecture for this kind of bot:

Favor asynchronous work over synchronous
...
Since it's impossible to predict how fast your service will complete, you should do all of "the real work" in a background job. Resque (for Ruby), RQ (for Python), or RabbitMQ (for Java) are examples of libraries that can handle queuing and processing of background jobs.

Note that even with a background job running, GitHub still expects your server to respond within ten seconds. Your server needs to acknowledge that it received the payload by sending some sort of response. It's critical that your service performs any validations on a payload as soon as possible, so that you can accurately report whether your server will continue with the request or not.
clakech commented 4 years ago

We are confident with this timeout issue because we have a HTTP 499 error in our load-balancer (traefik) and that means we have a "Client Closed Request" from the Github side.

bluekeyes commented 4 years ago

I believe the originally reported panic is fixed in 1.12.4; unfortunately, the panic occurred when there was a failure communicating with GitHub, so requests that previously panicked will still probably fail.

Any issues due to processing times higher than GitHub's timeout are still present and will probably require a bigger architecture change to solve.