kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.83k stars 2.64k forks source link

presubmits were not triggering for kubernetes/kubernetes #21090

Closed BenTheElder closed 3 years ago

BenTheElder commented 3 years ago

What happened:

When a PR is pushed | opened in Kubernetes we're not seeing jobs trigger, just the automatic github statuses for required jobs like:

pull-kubernetes-conformance-kind-ga-only-parallel Expected — Waiting for status to be reported

If you comment /test all manually jobs are triggered and run as expected.

What you expected to happen:

Tests should start when PRs that do not need ok-to-test are opened / pushed

How to reproduce it (as minimally and precisely as possible):

Push to or open a PR in github.com/kubernetes/kubernetes

Please provide links to example occurrences, if any:

https://github.com/kubernetes/kubernetes/pull/96968#issuecomment-788243206

Anything else we need to know?:

Seems to be happening to all new PRs in this repo at least. /area prow

spiffxp commented 3 years ago

another example: https://github.com/kubernetes/kubernetes/pull/99609

BenTheElder commented 3 years ago

2021-03-01 12:30:58.351 PST panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x782120] goroutine 2773 [running]: regexp.(Regexp).doExecute(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0035a85d6, 0x6, 0x0, 0x0, ...) GOROOT/src/regexp/exec.go:527 +0x560 regexp.(Regexp).doMatch(...) GOROOT/src/regexp/exec.go:514 regexp.(Regexp).MatchString(...) GOROOT/src/regexp/regexp.go:525 k8s.io/test-infra/prow/plugins/blockade.compileApplicableBlockades(0xc0035a8630, 0xa, 0xc0035a8620, 0xa, 0xc0035a85d6, 0x6, 0xc00214d260, 0xc001968400, 0x8, 0x9, ...) prow/plugins/blockade/blockade.go:221 +0xb5f k8s.io/test-infra/prow/plugins/blockade.handle(0x7fa94aa1a628, 0xc0038dee10, 0xc00214d260, 0xc001968400, 0x8, 0x9, 0x22b8680, 0xc00214d500, 0x20b85f8, 0xc003d42fd8, ...) prow/plugins/blockade/blockade.go:172 +0x1d5 k8s.io/test-infra/prow/plugins/blockade.handlePullRequest(0x233d3c0, 0xc0038dee10, 0x231d1c0, 0xc002f88780, 0x23364c0, 0xc002f5c6e0, 0xc002f8e450, 0x22f9560, 0xc00000f0e8, 0xc002f88800, ...) prow/plugins/blockade/blockade.go:126 +0x105 k8s.io/test-infra/prow/hook.(Server).handlePullRequestEvent.func1(0xc0015b15e0, 0xc00000e950, 0xc002ef2a00, 0xc00438b290, 0x8, 0x20b8608) prow/hook/events.go:202 +0x3c8 created by k8s.io/test-infra/prow/hook.(*Server).handlePullRequestEvent prow/hook/events.go:192 +0x612

BenTheElder commented 3 years ago

we just had some PRs to blockade, looks like we introduced an NPE

spiffxp commented 3 years ago

https://github.com/organizations/kubernetes/settings/hooks/10485935 - hooks are being delivered

EDIT: sorry, this link probably isn't visible to most

Screen Shot 2021-03-01 at 3 33 44 PM
BenTheElder commented 3 years ago

https://github.com/kubernetes/test-infra/pull/21021 was pretty recent started using it in https://github.com/kubernetes/test-infra/pull/21082 15 hours ago

spiffxp commented 3 years ago

revert https://github.com/kubernetes/test-infra/pull/21092

spiffxp commented 3 years ago

revert deployed

https://github.com/kubernetes/kubernetes/pull/99609#issuecomment-788270738 a /retest worked on a stuck pr

spiffxp commented 3 years ago

https://github.com/kubernetes/test-infra/pull/21093 - ben has a PR open to fix, but may not make it into today's autobump pr

BenTheElder commented 3 years ago

I think we should probably take another pass over this plugin before enabling this feature again, since I still haven't had a chance to trace back how we got to the NPE fully, but #21093 fixes gating on nil at the callsite where we NPEd at least.

@alvaroaleman also had a suggestion around ensuring hook recovers panics from plugins.

BenTheElder commented 3 years ago

https://github.com/kubernetes/test-infra/issues/21098 for the latter

spiffxp commented 3 years ago

/retitle presubmits were not triggering for kubernetes/kubernetes

spiffxp commented 3 years ago

Pulling out of slack

tl;dr I think setup a log-based metric in stack driver, setup prometheus to ingest metrics exported by stackdriver, keep alerting in prow’s monitoring stack

@alvaroaleman do y'all have something like this (or anything really) setup to detect panics in prow components?

Think this be a followup issue but AFK

alvaroaleman commented 3 years ago

@alvaroaleman do y'all have something like this (or anything really) setup to detect panics in prow components?

We don't have something specifically for panics, but we have a Slack alert for Prow pods crashlooping which I believe would have been triggered by this.

chaodaiG commented 3 years ago

We don't have something specifically for panics, but we have a Slack alert for Prow pods crashlooping which I believe would have been triggered by this.

@alvaroaleman , can we have this upstreamed? Or can you share where the config is located? I'd be happy to do the leg work

alvaroaleman commented 3 years ago

It's here @chaodaiG : https://github.com/openshift/release/blob/ac1b4f17255011592a2fb104d121668fd6b85ef5/clusters/app.ci/prow-monitoring/mixins/_prometheus/prow_alerts.libsonnet#L9

That alert is a fairly standard thing but requires kube-state-metrics to be set up: https://github.com/kubernetes/kube-state-metrics

chaodaiG commented 3 years ago

Loop back here: the prometheus alert was set up in https://github.com/kubernetes/test-infra/pull/21394