openshift / assisted-service

Apache License 2.0
109 stars 209 forks source link

Crash on cluster GET during installation #6550

Open almusil opened 2 months ago

almusil commented 2 months ago

I've got the following crash in the logs when I try to access the cluster page during installation:

10.45.224.87 - - [09/Jul/2024:14:01:35 +0000] "GET /api/assisted-install/v2/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69 HTTP/1.1" 200 43515 "http://wsfd-advnetlab157.anl.eng.rdu2.dc.redhat.com:8080/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69" "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0" "-"
 2024/07/09 14:01:35 http: panic serving 127.0.0.1:43844: write tcp 127.0.0.1:8090->127.0.0.1:43844: write: broken pipe

 net/http.(*conn).serve.func1()
         /usr/local/go/src/net/http/server.go:1854 +0xbf
 panic({0x4630ee0, 0xc00143e000})
         /usr/local/go/src/runtime/panic.go:890 +0x263
 github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetClusterOK).WriteResponse(0xc001bae008, {0x52bdd90, 0xc00250cd80}, {0x52a5e00, 0x4d22ab8})
         /assisted-service/restapi/operations/installer/v2_get_cluster_responses.go:56 +0xa6
 github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc002343110, {0x52bdd90?, 0xc00250cd80}, 0xc00087af00, {0xc000d9ffe0?, 0x1, 0x1}, 0xc00087ad00, {0x45bb0e0, 0xc001bae008})
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/context.go:510 +0x6b5
 github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetCluster).ServeHTTP(0xc0016a3848, {0x52bdd90, 0xc00250cd80}, 0xc00087af00)
         /assisted-service/restapi/operations/installer/v2_get_cluster.go:67 +0x35e
 github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x52bdd90, 0xc00250cd80}, 0xc00087af00)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52bdd90?, 0xc00250cd80?}, 0x17ab757?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/internal/metrics.Handler.func1.1()
         /assisted-service/internal/metrics/reporter.go:37 +0x31
 github.com/slok/go-http-metrics/middleware.Middleware.Measure({{{0x52bdd60, 0xc00070e000}, {0x4a64bc2, 0x12}, 0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x52d0880, ...}, ...)
         /assisted-service/vendor/github.com/slok/go-http-metrics/middleware/middleware.go:117 +0x30e
 github.com/openshift/assisted-service/internal/metrics.Handler.func1({0x52cbef0?, 0xc000a9a540}, 0xc00087af00)
         /assisted-service/internal/metrics/reporter.go:36 +0x35f
 net/http.HandlerFunc.ServeHTTP(0x52cd7b0?, {0x52cbef0?, 0xc000a9a540?}, 0xc001b92550?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/context.ContextHandler.func1.1({0x52cbef0, 0xc000a9a540}, 0xc00087ae00)
         /assisted-service/pkg/context/param.go:95 +0xc8
 net/http.HandlerFunc.ServeHTTP(0xc00213c448?, {0x52cbef0?, 0xc000a9a540?}, 0x31475a3235305661?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/go-openapi/runtime/middleware.NewRouter.func1({0x52cbef0, 0xc000a9a540}, 0xc00087ac00)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/router.go:77 +0x257
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52cbef0?, 0xc000a9a540?}, 0xc00213c688?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/go-openapi/runtime/middleware.Redoc.func1({0x52cbef0, 0xc000a9a540}, 0x7?)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/redoc.go:72 +0x242
 net/http.HandlerFunc.ServeHTTP(0x43317c0?, {0x52cbef0?, 0xc000a9a540?}, 0x171ec6a?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/go-openapi/runtime/middleware.Spec.func1({0x52cbef0, 0xc000a9a540}, 0xc001dc61c0?)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/spec.go:46 +0x18c
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52cbef0?, 0xc000a9a540?}, 0x4?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x52cb9b0, 0xc001dc61c0}, 0xc00213c8f0?)
         /assisted-service/vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e
 net/http.HandlerFunc.ServeHTTP(0x10?, {0x52cb9b0?, 0xc001dc61c0?}, 0xc00213c8d0?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/app.WithMetricsResponderMiddleware.func1({0x52cb9b0?, 0xc001dc61c0?}, 0xc00012c0a0?)
         /assisted-service/pkg/app/middleware.go:34 +0xb0
 net/http.HandlerFunc.ServeHTTP(0x7f2bf3656108?, {0x52cb9b0?, 0xc001dc61c0?}, 0xc000b8cb70?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/app.WithHealthMiddleware.func1({0x52cb9b0?, 0xc001dc61c0?}, 0x5291801?)
         /assisted-service/pkg/app/middleware.go:57 +0x162
 net/http.HandlerFunc.ServeHTTP(0x52cd708?, {0x52cb9b0?, 0xc001dc61c0?}, 0x52918c0?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/requestid.handler.ServeHTTP({{0x52a8280?, 0xc00139b180?}}, {0x52cb9b0, 0xc001dc61c0}, 0xc00087ab00)
         /assisted-service/pkg/requestid/requestid.go:69 +0x1ad
 github.com/openshift/assisted-service/internal/spec.WithSpecMiddleware.func1({0x52cb9b0?, 0xc001dc61c0?}, 0xc00087ab00?)
         /assisted-service/internal/spec/spec.go:38 +0x9b
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52cb9b0?, 0xc001dc61c0?}, 0x17820ce?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 net/http.serverHandler.ServeHTTP({0xc000b8cab0?}, {0x52cb9b0, 0xc001dc61c0}, 0xc00087ab00)
         /usr/local/go/src/net/http/server.go:2936 +0x316
 net/http.(*conn).serve(0xc001caabd0, {0x52cd7b0, 0xc000ea7a10})
         /usr/local/go/src/net/http/server.go:1995 +0x612
 created by net/http.(*Server).Serve
         /usr/local/go/src/net/http/server.go:3089 +0x5ed

The container version of AI:

quay.io/edge-infrastructure/assisted-image-service  latest            ef3637bb83cf  13 hours ago   179 MB
quay.io/edge-infrastructure/assisted-service        latest            bf3148e62279  16 hours ago   956 MB
quay.io/edge-infrastructure/assisted-installer-ui   latest            9d95da3d91dd  30 hours ago   462 MB
rccrdpccl commented 2 months ago

Thanks for reporting this @almusil

Can you please provide the version of the service that crashed? Looks like it's the assisted-service, you can extract the version by podman inspect and paste here ImageDigest field How was the service deployed? Is this a podman deployment? Where is the output you have pasted above coming from? (what's the access log line, is it the UI component logs interleaved with assisted-service's logs?)

almusil commented 2 months ago

Thanks for reporting this @almusil

Can you please provide the version of the service that crashed? Looks like it's the assisted-service, you can extract the version by podman inspect and paste here ImageDigest field How was the service deployed?

ImageDigest for all the AI images, even though the log suggests that it is the assisted-service.

# podman inspect quay.io/edge-infrastructure/assisted-image-service -f "{{ .Digest }}"
sha256:3b6baaeef183b158872468e4027f7851acb8c6dd6f45d80479a86360ab8c2682

# podman inspect quay.io/edge-infrastructure/assisted-service -f "{{ .Digest }}"
sha256:b5e0616e829acef9592fd87936b2f1da709426b24b4273879baad13c1d894db1

# podman inspect quay.io/edge-infrastructure/assisted-installer-ui -f "{{ .Digest }}"
sha256:38f20bd8214ed7a414e813daf1531b5e98db55433fc423024ddea87adb6d4798

Is this a podman deployment?

Yes it is podman deployment.

Where is the output you have pasted above coming from?

This was observed in the journal. The AI is running as part of bigger project to automate openshift deployments.

(what's the access log line, is it the UI component logs interleaved with assisted-service's logs?)

Not sure if I understand this one, can you please elaborate?

rccrdpccl commented 2 months ago

Not sure if I understand this one, can you please elaborate?

I'm interested in those lines:

10.45.224.87 - - [09/Jul/2024:14:01:35 +0000] "GET /api/assisted-install/v2/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69 HTTP/1.1" 200 43515 "http://wsfd-advnetlab157.anl.eng.rdu2.dc.redhat.com:8080/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69" "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0" "-"
 2024/07/09 14:01:35 http: panic serving 127.0.0.1:43844: write tcp 127.0.0.1:8090->127.0.0.1:43844: write: broken pipe

The first line seems to be an access log, which should not have been emitted by assisted-service. The second line is a panic for unhandled broken connection, which might be assisted service. Can you confirm it's the assisted service emitting the panic log, and not other applications?

If that's the case, could you describe what is "accessing the cluster page"? Is it accessing assisted-service-ui at the cluster page? If so, could you also try with curl and the API if we get the same crash?

It seems that the service could be trying to write to a closed connection.

almusil commented 2 months ago

From the journal it is the assisted-installer-service: assisted-installer-service[2211655]: 2024/07/10 13:09:40 http: panic serving 127.0.0.1:42856: write tcp 127.0.0.1:8090->127.0.0.1:42856: write: broken pipe

I cannot pinpoint exact URL that causes this with CURL.

Screenshot from 2024-07-10 15-15-06

Screenshot from 2024-07-10 15-15-20

Clicking the ovnscale causes the failure, not sure how to better describe it.

rccrdpccl commented 1 month ago

the path is GET /api/assisted-install/v2/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69 (as in the access logs), but you probably need authentication, would you be able to run that and see if it causes the same issue?

rccrdpccl commented 1 month ago

You can also inspect what the UI is doing with developer tools and replay the same request, but from curl instead

almusil commented 1 month ago

Tried both without any success unfortunately

rccrdpccl commented 1 month ago

WDYM "without any success"? Still crashing? Can you try to paste the curl's output?

almusil commented 1 month ago

The curl doesn't crash, I don't have the cluster anymore unfortunately

rccrdpccl commented 1 month ago

Do you have a minimal example/setup where we can reproduce this issue?