open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.73k stars 2.16k forks source link

Supervisor hangs when OpAMP server backend is restarted #33799

Open acrmp opened 1 week ago

acrmp commented 1 week ago

Component(s)

cmd/opampsupervisor

What happened?

Description

The supervisor appears to hang reconnecting to the OpAMP server backend when the server is restarted.

Steps to Reproduce

  1. Start the example OpAMP server
  2. Start the supervisor and see that it successfully starts
  3. See that the agent is visible in the example server UI
  4. Stop the example server (CTRL-C)
  5. Supervisor reports that the connection is closed and that it will retry to connect
  6. Start the example server again

Expected Result

Supervisor logs that it has reconnected to the server.

Actual Result

Collector version

7c573a9f

Environment information

Environment

OpenTelemetry Collector configuration

No response

Log output

2024-06-28T01:47:10.752Z        ERROR   supervisor/logger.go:26 Connection failed (dial tcp 127.0.0.1:4320: connect: connection refused), will retry.
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*opAMPLogger).Errorf
        /home/pivotal/workspace/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/logger.go:26
github.com/open-telemetry/opamp-go/client.(*wsClient).ensureConnected
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:207
github.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:245
github.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:330
github.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/internal/clientcommon.go:197

Additional context

It looks like the supervisor is blocking in the OnConnectFunc callback sending to the unbuffered connectedToOpAMPServer channel. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/7c573a9ffe543a33d3f1aae439e3540bac303f05/cmd/opampsupervisor/supervisor/supervisor.go#L390

$ killall -3 opampsupervisor
...
goroutine 28 gp=0xc000102fc0 m=nil [chan send]:
runtime.gopark(0xa82340?, 0xc000025380?, 0x0?, 0x70?, 0x55?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc0002abad0 sp=0xc0002abab0 pc=0x4402ae
runtime.chansend(0xc00010e300, 0xc0002abba7, 0x1, 0xc0002abb90?)
        /usr/local/go/src/runtime/chan.go:259 +0x38d fp=0xc0002abb40 sp=0xc0002abad0 pc=0x40b3cd
runtime.chansend1(0x18?, 0xc000035f80?)
        /usr/local/go/src/runtime/chan.go:145 +0x17 fp=0xc0002abb70 sp=0xc0002abb40 pc=0x40b037
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).startOpAMPClient.func1({0x98fda0?, 0x0?})
        /home/pivotal/workspace/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:390 +0x28 fp=0xc0002abbc0 sp=0xc0002abb70 pc=0x8ade48
github.com/open-telemetry/opamp-go/client/types.CallbacksStruct.OnConnect(...)
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/types/callbacks.go:140
github.com/open-telemetry/opamp-go/client/types.(*CallbacksStruct).OnConnect(0xc00025c748?, {0xa86098?, 0xc00028c2d0?})
        <autogenerated>:1 +0x5e fp=0xc0002abc20 sp=0xc0002abbc0 pc=0x7f487e
github.com/open-telemetry/opamp-go/client.(*wsClient).tryConnectOnce(0xc00025c600, {0xa86098, 0xc00028c2d0})
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:178 +0x53f fp=0xc0002abce0 sp=0xc0002abc20 pc=0x89edff
github.com/open-telemetry/opamp-go/client.(*wsClient).ensureConnected(0xc00025c600, {0xa86098, 0xc00028c2d0})
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:201 +0x10c fp=0xc0002abd90 sp=0xc0002abce0 pc=0x89ef4c
github.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle(0xc00025c600, {0xa86098, 0xc00028c2d0})
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:245 +0x51 fp=0xc0002abf50 sp=0xc0002abd90 pc=0x89f1f1
github.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped(0xc00025c600, {0xa86098, 0xc00028c2d0})
        /home/pivotal/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.15.0/client/wsclient.go:330 +0x33 fp=0xc0002abf78 sp=0xc0002abf50 pc=0x89fab3
github.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped-fm({0xa86098?, 0xc00028c2d0?})
        <autogenerated>:1 +0x33 fp=0xc0002abfa0 sp=0xc0002abf78 pc=0x89fc73
github.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1()
...
github-actions[bot] commented 1 week ago

Pinging code owners: