openshift-pipelines / pipeline-service

SaaS for Tekton Pipelines
Apache License 2.0
23 stars 44 forks source link

tektoncd-results update #968

Closed gabemontero closed 6 months ago

gabemontero commented 6 months ago

replaces https://github.com/openshift-pipelines/pipeline-service/pull/966 whose rebase somehow go in a bad state

gabemontero commented 6 months ago

I think I'm hitting some flakes in the 2 test

Attempting a manual upgrade / downgrade out of https://github.com/gabemontero/pipeline-service/tree/man-upgrade with dev_setup.sh

I've also pushed a fix to log gathering on test failures in this PR, though I forget if changes to the test scripts are picked up in PRs

gabemontero commented 6 months ago

baseline passed again, another flake in upgrade, but hcp / rosa related again I think @xinredhat

step-run-plnsvc-setup
+ touch /workspace/workdir/source/destroy-cluster.txt
+ echo 'Execute dev_setup.sh script to set up pipeline-service ...'
Execute dev_setup.sh script to set up pipeline-service ...
+ for _ in {1..3}
+ kubectl -n default exec pod/ci-runner -- sh -c '/workspace/sidecar/bin/plnsvc_setup.sh https://github.com/openshift-pipelines/pipeline-service main'
Error from server: error dialing backend: EOF
+ echo 'Failed to execute dev_setup.sh script, retrying ...'
Failed to execute dev_setup.sh script, retrying ...
+ sleep 5
+ for _ in {1..3}
+ kubectl -n default exec pod/ci-runner -- sh -c '/workspace/sidecar/bin/plnsvc_setup.sh https://github.com/openshift-pipelines/pipeline-service main'
Error from server: error dialing backend: EOF
Failed to execute dev_setup.sh script, retrying ...
+ echo 'Failed to execute dev_setup.sh script, retrying ...'
+ sleep 5
+ for _ in {1..3}
+ kubectl -n default exec pod/ci-runner -- sh -c '/workspace/sidecar/bin/plnsvc_setup.sh https://github.com/openshift-pipelines/pipeline-service main'
Error from server: error dialing backend: EOF
Failed to execute dev_setup.sh script, retrying ...
+ echo 'Failed to execute dev_setup.sh script, retrying ...'
+ sleep 5
gabemontero commented 6 months ago

/retest

gabemontero commented 6 months ago

manual testing with repeated upgrade / downgrade looks good, including my diagnostic logs wrt canceled context vs. deadline exceeded

will give the upgrade test one more try, then merging

gabemontero commented 6 months ago

analyzing the latest failure ...

gabemontero commented 6 months ago

So I see this patter of a CreateRecord failing, followed by a UpdateLog call, in the upgrade test that I don't see either in my manual testing of an upgrade or the install from scratch test here in CI:

{"level":"error","ts":1710358372.681495,"caller":"zap/options.go:212","msg":"finished unary call with code Unknown","grpc.auth_disabled":false,"grpc.start_time":"2024-03-13T19:32:51Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Results","grpc.method":"CreateResult","peer.address":"10.128.0.65:35622","grpc.user":"system:serviceaccount:tekton-results:tekton-results-watcher","grpc.issuer":"https://rh-oidc.s3.us-east-1.amazonaws.com/273tbj71skqksgqafoe5aotsuc44blp4","error":"ERROR: duplicate key value violates unique constraint \"results_by_name\" (SQLSTATE 23505)","grpc.code":"Unknown","grpc.time_duration_in_ms":999,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.UnaryServerInterceptor.func1\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/logging/zap/server_interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.UnaryServerInterceptor.func1\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/tags/interceptors.go:23\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34\ngithub.com/tektoncd/results/proto/v1alpha2/results_go_proto._Results_CreateResult_Handler\n\t/opt/app-root/src/proto/v1alpha2/results_go_proto/api_grpc.pb.go:258\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/opt/app-root/src/vendor/google.golang.org/grpc/server.go:1372\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/opt/app-root/src/vendor/google.golang.org/grpc/server.go:1783\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/opt/app-root/src/vendor/google.golang.org/grpc/server.go:1016"} {"level":"info","ts":1710358372.881164,"caller":"zap/options.go:212","msg":"finished unary call with code OK","grpc.auth_disabled":false,"grpc.start_time":"2024-03-13T19:32:52Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Results","grpc.method":"GetRecord","peer.address":"10.128.0.65:35622","grpc.user":"system:serviceaccount:tekton-results:tekton-results-watcher","grpc.issuer":"https://rh-oidc.s3.us-east-1.amazonaws.com/273tbj71skqksgqafoe5aotsuc44blp4","grpc.code":"OK","grpc.time_duration_in_ms":798} {"level":"info","ts":1710358373.0968993,"caller":"zap/options.go:212","msg":"finished streaming call with code OK","grpc.auth_disabled":false,"grpc.start_time":"2024-03-13T19:32:51Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.128.0.65:35622","grpc.user":"system:serviceaccount:tekton-results:tekton-results-watcher","grpc.issuer":"https://rh-oidc.s3.us-east-1.amazonaws.com/273tbj71skqksgqafoe5aotsuc44blp4","grpc.code":"OK","grpc.time_duration_in_ms":1759}

And then in the test, fetches of records for the new pipelinerun work, but the logs in the script do not.

Feels like an issue in the CI test, maybe postgresql and s3 getting out of sync, during the upgrades/downgrades.

Also does not help that the log name i.e. 4627a656-6756-345c-a3c1-2be541fd5a26 is not mentioned in the UpdateLogs call.

Going to add a little more debug, then discuss with team in standup.

gabemontero commented 6 months ago

well, and now it passed

merging