openshift-pipelines / pipelines-as-code

Pipelines-as-Code for Tekton
https://pipelinesascode.com
Apache License 2.0
124 stars 81 forks source link

Pipelines as Code controller may crash on very high loads #1635

Closed chmouel closed 3 months ago

chmouel commented 3 months ago

while doing stress testing over 200 pipelineruns, I saw this crash hapenning after a while (there was only 2 left to process):

cc @sm43

💡 08:30:49 issue_comment: pipelinerun recheck on chmouel/e2e-gapps#159 has been requested
panic: reflect: call of reflect.Value.Interface on zero Value
goroutine 13064 [running]:
reflect.valueInterface({0x0?, 0x0?, 0xc000a71d40?}, 0xc0?)
        /usr/lib/golang/src/reflect/value.go:1485 +0x10e
reflect.Value.Interface(...)
        /usr/lib/golang/src/reflect/value.go:1480
github.com/openshift-pipelines/pipelines-as-code/pkg/sort.findJSONPathResults(0x251dd3e?, {0x2850fd8?, 0x0?})
        /src/pkg/sort/runtime_sort.go:224 +0xe5
github.com/openshift-pipelines/pipelines-as-code/pkg/sort.(*RuntimeSort).Less(0xc0003c2180, 0x2852f18?, 0xc000be6280?)
        /src/pkg/sort/runtime_sort.go:191 +0x29f
sort.insertionSort({0x285ab30, 0xc0003c2180}, 0x0, 0x3)
        /usr/lib/golang/src/sort/zsortinterface.go:12 +0xb1
sort.pdqsort({0x285ab30, 0xc0003c2180}, 0x40?, 0x229e300?, 0xc000f89b01?)
        /usr/lib/golang/src/sort/zsortinterface.go:73 +0x2f7
sort.Sort({0x285ab30, 0xc0003c2180})
        /usr/lib/golang/src/sort/sort.go:48 +0x5d
github.com/openshift-pipelines/pipelines-as-code/pkg/sort.ByField({0x252cac8, 0x10}, {0xc0003c2140, 0x3, 0x4})
        /src/pkg/sort/runtime_sort.go:230 +0x10c
github.com/openshift-pipelines/pipelines-as-code/pkg/pipelineascode.(*ConcurrencyManager).GetExecutionOrder(0xc000d09a10)
        /src/pkg/pipelineascode/concurrency.go:52 +0x130
github.com/openshift-pipelines/pipelines-as-code/pkg/pipelineascode.(*PacRun).Run(0xc0003ef950, {0x28688f0?, 0xc00062ecc0})
        /src/pkg/pipelineascode/pipelineascode.go:107 +0x776
github.com/openshift-pipelines/pipelines-as-code/pkg/adapter.(*sinker).processEvent(0xc0006cb7c0, {0x28688f0, 0xc00062ecc0}, 0x74313c?)
        /src/pkg/adapter/sinker.go:58 +0x375
github.com/openshift-pipelines/pipelines-as-code/pkg/adapter.listener.handleEvent.func1.1()
        /src/pkg/adapter/adapter.go:187 +0x36
created by github.com/openshift-pipelines/pipelines-as-code/pkg/adapter.listener.handleEvent.func1
        /src/pkg/adapter/adapter.go:186 +0x955
%
sm43 commented 3 months ago

https://github.com/openshift-pipelines/pipelines-as-code/pull/1636

this probabaly won't fix the issue, but I couldn't find a reason for the failure. 😞 but updating to latest

sm43 commented 3 months ago

we can update the code to not panic and return some error 🤔 I tried multiple test but couldn't reproduce the issue @chmouel

chmouel commented 3 months ago

yeah it's a very weird ones, i did try a few runs of stress testing of more than 600 Pruns over a concurrency of 5 and a max-keep-run of 2 and could not reproduce either anymore.

I remember when I was trying to move the cleanup process to the controller the other times i had that same exact errors,

but yeah i think the easiest way is to check if the prun is nil isnt it before sorting?

sm43 commented 3 months ago

but yeah i think the easiest way is to check if the prun is nil isnt it before sorting?

yeah will add nil check.

sm43 commented 3 months ago

can we do another load test ? 👀

chmouel commented 3 months ago

I can try 🙃

will launch 1000 and sees how it goes

chmouel commented 3 months ago

I did a lot of tests this morning, multiple runs:

2 repos of 3 pipelineruns with a max-keep-run of 2

one repo is a concurrency of 5 and the other of 10

I ran a 10 loop of 30 /retest on each pipelineruns (using a second controller on ghe for avoiding rate limiting)

so 10(303)=900 * 2 repos == 1800

since we have a concurrency of 5 and 10 we have 15 pipelineruns running all the time, and only 12 left as Succeedeed.

Everything went to completion. The watcher gradually increase its amount of memory, but i think it's the knative library and its informers who does (it's worse on pipeline reconciler) that.

The only issues i had is a name conflict on the secret generation which is 4 letters random a 10^4 so over 1800 there is 14% of possibility this occurs. We may want to increase the secret name length to avoid that...

but other than that so far so good, no crash whatsoever...

little image to illustrate this comment:

image(1)

i guess we can close it