Closed BenTheElder closed 3 years ago
another example: https://github.com/kubernetes/kubernetes/pull/99609
2021-03-01 12:30:58.351 PST panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x782120] goroutine 2773 [running]: regexp.(Regexp).doExecute(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0035a85d6, 0x6, 0x0, 0x0, ...) GOROOT/src/regexp/exec.go:527 +0x560 regexp.(Regexp).doMatch(...) GOROOT/src/regexp/exec.go:514 regexp.(Regexp).MatchString(...) GOROOT/src/regexp/regexp.go:525 k8s.io/test-infra/prow/plugins/blockade.compileApplicableBlockades(0xc0035a8630, 0xa, 0xc0035a8620, 0xa, 0xc0035a85d6, 0x6, 0xc00214d260, 0xc001968400, 0x8, 0x9, ...) prow/plugins/blockade/blockade.go:221 +0xb5f k8s.io/test-infra/prow/plugins/blockade.handle(0x7fa94aa1a628, 0xc0038dee10, 0xc00214d260, 0xc001968400, 0x8, 0x9, 0x22b8680, 0xc00214d500, 0x20b85f8, 0xc003d42fd8, ...) prow/plugins/blockade/blockade.go:172 +0x1d5 k8s.io/test-infra/prow/plugins/blockade.handlePullRequest(0x233d3c0, 0xc0038dee10, 0x231d1c0, 0xc002f88780, 0x23364c0, 0xc002f5c6e0, 0xc002f8e450, 0x22f9560, 0xc00000f0e8, 0xc002f88800, ...) prow/plugins/blockade/blockade.go:126 +0x105 k8s.io/test-infra/prow/hook.(Server).handlePullRequestEvent.func1(0xc0015b15e0, 0xc00000e950, 0xc002ef2a00, 0xc00438b290, 0x8, 0x20b8608) prow/hook/events.go:202 +0x3c8 created by k8s.io/test-infra/prow/hook.(*Server).handlePullRequestEvent prow/hook/events.go:192 +0x612
we just had some PRs to blockade, looks like we introduced an NPE
https://github.com/organizations/kubernetes/settings/hooks/10485935 - hooks are being delivered
EDIT: sorry, this link probably isn't visible to most
https://github.com/kubernetes/test-infra/pull/21021 was pretty recent started using it in https://github.com/kubernetes/test-infra/pull/21082 15 hours ago
revert deployed
https://github.com/kubernetes/kubernetes/pull/99609#issuecomment-788270738 a /retest
worked on a stuck pr
https://github.com/kubernetes/test-infra/pull/21093 - ben has a PR open to fix, but may not make it into today's autobump pr
I think we should probably take another pass over this plugin before enabling this feature again, since I still haven't had a chance to trace back how we got to the NPE fully, but #21093 fixes gating on nil at the callsite where we NPEd at least.
@alvaroaleman also had a suggestion around ensuring hook recovers panics from plugins.
https://github.com/kubernetes/test-infra/issues/21098 for the latter
/retitle presubmits were not triggering for kubernetes/kubernetes
Pulling out of slack
tl;dr I think setup a log-based metric in stack driver, setup prometheus to ingest metrics exported by stackdriver, keep alerting in prow’s monitoring stack
@alvaroaleman do y'all have something like this (or anything really) setup to detect panics in prow components?
Think this be a followup issue but AFK
@alvaroaleman do y'all have something like this (or anything really) setup to detect panics in prow components?
We don't have something specifically for panics, but we have a Slack alert for Prow pods crashlooping which I believe would have been triggered by this.
We don't have something specifically for panics, but we have a Slack alert for Prow pods crashlooping which I believe would have been triggered by this.
@alvaroaleman , can we have this upstreamed? Or can you share where the config is located? I'd be happy to do the leg work
It's here @chaodaiG : https://github.com/openshift/release/blob/ac1b4f17255011592a2fb104d121668fd6b85ef5/clusters/app.ci/prow-monitoring/mixins/_prometheus/prow_alerts.libsonnet#L9
That alert is a fairly standard thing but requires kube-state-metrics to be set up: https://github.com/kubernetes/kube-state-metrics
Loop back here: the prometheus alert was set up in https://github.com/kubernetes/test-infra/pull/21394
What happened:
When a PR is pushed | opened in Kubernetes we're not seeing jobs trigger, just the automatic github statuses for required jobs like:
If you comment
/test all
manually jobs are triggered and run as expected.What you expected to happen:
Tests should start when PRs that do not need
ok-to-test
are opened / pushedHow to reproduce it (as minimally and precisely as possible):
Push to or open a PR in github.com/kubernetes/kubernetes
Please provide links to example occurrences, if any:
https://github.com/kubernetes/kubernetes/pull/96968#issuecomment-788243206
Anything else we need to know?:
Seems to be happening to all new PRs in this repo at least. /area prow