open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.86k stars 2.23k forks source link

Remote Configuration Capability of Supervisor is not restarting my collector if the configuration of the collector is changes #32959

Open MSA0208 opened 3 months ago

MSA0208 commented 3 months ago

Component(s)

No response

Describe the issue you're reporting

Hi Team,

Currently have connected my opamp-server, opamp supervisor which has an executable of my collector and its running fine using the below supervisor.yaml server: endpoint: ws://127.0.0.1:4320/v1/opamp agent: executable: /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/OutputBinaries/NGxConnector

args: --config /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/config.yaml

, Now i added the capability of the supervisor to accept the remote configurations i.e, server: endpoint: ws://127.0.0.1:4320/v1/opamp capabilities: AcceptsRemoteConfig: true agent: executable: /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/OutputBinaries/NGxConnector

args: --config /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/config.yaml

after adding this change in my supervisor.yaml and starting the supervisor to run my executable, its fine

My actual problem is when i change the config.yaml of the collector pipeline , the same is not reflected on the supervisor or the agent side . please help me out to get this remote config working

am using the otel-collector-main latest version, along with opamp-go-main latest version and also the extension to my collector-contrib-main version of the OTEL code

github-actions[bot] commented 3 months ago

Pinging code owners for cmd/opampsupervisor: @evan-bradley @atoulme @tigrannajaryan. See Adding Labels via Comments if you do not have permissions to add labels yourself.

evan-bradley commented 3 months ago

Hi, @MSA0208. When you say you are changing the config.yaml file, do you mean this one?

/root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/config.yaml

The Supervisor will only restart the Collector when it receives new configuration from the OpAMP server; changes to files on disk will not restart the Collector. If you are changing the code in the OpAMP server, do you see any logs in the Supervisor about receiving new config?

evan-bradley commented 3 months ago

@MSA0208 I deleted your comment because I noticed there were some credentials in there. I would suggest you rotate the tokens and change the passwords used in your config.

evan-bradley commented 3 months ago

Hi @evan-bradley Thank you so much for the Reply.

I want to know what kind of changes and which file change will result in restart of the Collector. yes i have added the logs from the Opamp Server code and also in the supervisor code from this github https://github.com/open-telemetry/opamp-go/tree/main/internal/examples/supervisor from this code , i have added logs althrough the methods , and found effective.yaml is the one which gets executed >along with the args passed , so tried changing effective.yaml Manually and tried , but still dint work and supervisor/bin folder has [...] Have placed my actual config.yaml required for the collector in the same folder for testing purpose and tried modifying it, but that dint restart my collector.

Please let me know the exact steps to follow to restart the collector on what dynamic changes

Thanks for the details. The only file you should modify directly is the Supervisor's configuration file. When using the Supervisor, all Collector configuration updates should be made through the OpAMP server, which will send them to the Supervisor and restart the Collector with the new config. The effective.yaml file should not be directly edited, it's only intended to be created/updated by the Supervisor.

MSA0208 commented 3 months ago

Hi @evan-bradley,

Thank you , Let me try this and get back to you. sorry missed to mask or remove my creds from the config.yaml

tigrannajaryan commented 3 months ago

The effective.yaml file should not be directly edited, it's only intended to be created/updated by the Supervisor.

To avoid future user confusion should we prepend effective.yaml file with a comment telling that it is autogenerated, is not meant to be user-editable and will be overwritten by supervisor?

evan-bradley commented 3 months ago

I was thinking the same thing, we should clearly indicate which files are not intended to be modified by the user.

MSA0208 commented 3 months ago

Hi @evan-bradley,

am using otel contrib Supervisor , https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/cmd/opampsupervisor , instead of this opamp-go given supervisor. as it has some additional capabilities of remote configurations specified as per document which are needed for my usecase and i follow the document exactly and the server is starting , but getting the issue below *configFlag supervisor.yaml Config Loaded supervisor.yaml 2024-05-15T02:44:15.515-0700 DEBUG commander/commander.go:74 Starting agent {"agent": "/root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/OutputBinaries/NGxConnector"} 2024-05-15T02:44:15.516-0700 DEBUG commander/commander.go:93 Agent process started {"pid": 60790} 2024-05-15T02:44:18.518-0700 ERROR opampsupervisor/main.go:26 could not get bootstrap info from the Collector: collector's OpAMP client never connected to the Supervisor main.main /root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/main.go:26 runtime.main /usr/local/go/src/runtime/proc.go:271

Can you please help me with the things to configure to solve this issue

MSA0208 commented 3 months ago

Hi @everyone,

Expecting the solution response!!

MSA0208 commented 3 months ago

I could able to solve the above issue,

Now the issue am facing is Agent is not healthy, meaning have started my Supervisor on some random port and that is starting my agent collector but my agent says unable to connect to the supervisor, giving the statement Connection Refused.

below is the sample collector config.yaml used for agent collector and the same am using as my bootstrap.yaml .

collector-config.yaml extensions: opamp: instance_uid: 01HYAH3BNC06AFVGQT5ZYC0GEK server: ws: endpoint: ws://127.0.0.1:4322/v1/opamp health_check: endpoint: "localhost:4444"

tls:

#  ca_file: "/path/to/ca.crt"
#  cert_file: "/path/to/cert.crt"
#  key_file: "/path/to/key.key"
path: "/health/status"
check_collector_pipeline:
  enabled: true
  interval: "5m"
  exporter_failure_threshold: 5

Let me know what else could be causing the issue or redirect me to the fix which has solved this agent Health and also after receiving my remote config, supervisor is unable to restart my agent collector , am thinking this could be because of the connection issue

****opamp-extension/agent log*****

2024-05-27T05:59:37.069-0700 error opampextension@v0.98.0/opamp_agent.go:72 Failed to connect to the OpAMP server {"kind": "extension", "name": "opamp", "error": "dial tcp 127.0.0.1:4322: connect: connection refused"} github.com/open-telemetry/opentelemetry-collector-contrib/extension/opampextension.(opampAgent).Start.func2 github.com/open-telemetry/opentelemetry-collector-contrib/extension/opampextension@v0.98.0/opamp_agent.go:72 github.com/open-telemetry/opamp-go/client/types.CallbacksStruct.OnConnectFailed github.com/open-telemetry/opamp-go@v0.14.0/client/types/callbacks.go:149 github.com/open-telemetry/opamp-go/client.(wsClient).tryConnectOnce github.com/open-telemetry/opamp-go@v0.14.0/client/wsclient.go:153 github.com/open-telemetry/opamp-go/client.(wsClient).ensureConnected github.com/open-telemetry/opamp-go@v0.14.0/client/wsclient.go:217 github.com/open-telemetry/opamp-go/client.(wsClient).runOneCycle github.com/open-telemetry/opamp-go@v0.14.0/client/wsclient.go:261 github.com/open-telemetry/opamp-go/client.(wsClient).runUntilStopped github.com/open-telemetry/opamp-go@v0.14.0/client/wsclient.go:346 github.com/open-telemetry/opamp-go/client/internal.(ClientCommon).StartConnectAndRun.func1 github.com/open-telemetry/opamp-go@v0.14.0/client/internal/clientcommon.go:202 2024-05-27T05:59:37.069-0700 error opampextension@v0.98.0/logger.go:26

****supervisor logs

Response from HealthChecker: &{404 Not Found 404 HTTP/1.1 1 1 map[Content-Length:[19] Content-Type:[text/plain; charset=utf-8] Date:[Mon, 27 May 2024 12:51:43 GMT] X-Content-Type-Options:[nosniff]] 0xc000040120 19 [] false false map[] 0xc0002165a0 } health check on %s returned %d http://localhost:4444/ 404 2024-05-27T05:51:43.834-0700 ERROR supervisor/supervisor.go:884 Agent is not healthy {"error": "health check on http://localhost:4444/ returned 404"} github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(Supervisor).healthCheck /root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/supervisor/supervisor.go:884 github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(Supervisor).runAgentProcess /root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/supervisor/supervisor.go:955 github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.NewSupervisor.func1 /root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/supervisor/supervisor.go:207 Inside SetHealth from clientCommon.go!!!: start_time_unix_nano:1716810290622218705 last_error:"health check on http://localhost:4444/ returned 404"

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Asarew commented 4 weeks ago

@MSA0208 How did you solve the issue of:

could not get bootstrap info from the Collector: collector's OpAMP client never connected to the Supervisor
MSA0208 commented 4 weeks ago

hi @Asarew ,

Collector was not getting started with nop, i provided the actual service configuration along with opamp extension configured

cforce commented 4 weeks ago

"Collector was not getting started with nop, " Sounds like a feature , not a bug Why shall it start if there is no todo?

MSA0208 commented 4 weeks ago

Related Issue :[cmd/opampsupervisor] Use nop components during bootstrapping #32554

Asarew commented 4 weeks ago

My issue was that i didn't build the nop receiver and exporter with the collector.

MSA0208 commented 4 weeks ago

Hope your issue is solved now

Now the current issue am facing is, have configured some random port for the supervisor and started my collector with supervisor and opamp server, server is able to communicate the remote changes to supervisor , but supervisor is not informing about the remote to my actual collector

Asarew commented 4 weeks ago

As far as i know, the supervisor starts on a random port just for the bootstrap communication. after that there is no communication between the supervisor and the collector except for restarts.

MSA0208 commented 4 weeks ago

ok, so you mean that we cant send the remote config received from opamp server to our collector client using supervisor??

if that is the case , how to send the remote config received at supervisor to the collector ?

Asarew commented 4 weeks ago

the supervisor writes the configuration to disk and then restarts the collector

MSA0208 commented 4 weeks ago

yeah that will be the effective.yaml file. but when i use that effective.yaml am continuosly observing the restarts at the client side , which is my collector its always in restarting phase

Asarew commented 4 weeks ago

Hmm, maybe check the agent.log file. i'm afraid i don't have a specific answer to you issue 😢

MSA0208 commented 4 weeks ago

Thanks for pointing out at agent.log, i got the issue, yet to solve it , will do :)

Asarew commented 4 weeks ago

@MSA0208 Your welcome, good luck 👍🏾

MSA0208 commented 4 weeks ago

I Could solve the issue and my Opamp is working fine for the remote configurations now.

i also tried removing few things from pipeline, i think that would cause the error in the collector

for ex : i have my log and metric pipeline configured and i want to remove the metric pipeline, its not considering the removal

MSA0208 commented 3 weeks ago

@Asarew @cforce

Have you anytime tried by updating the existing config.yaml through this opamp remote way? does that work?

because the API on the web console says Additional configurations ?

what sort of changes to the existing config.yaml will be applied like, update, add, delete ?

Asarew commented 3 weeks ago

i can let you know beginning next week, i'm still developing the otel controller and haven't gotten to updates yet. just the initial config push

MSA0208 commented 3 weeks ago

@Asarew Sure , Thank you by then i will try all the possible ways of remote changes to apply and observe the behaviour

Asarew commented 3 weeks ago

Took me a while to fix the controller, but now i can push new changes from the controller down to the supervisor which in turn saves the config to disk and restarts the collector. So for me everything seems to be working fine.

cforce commented 2 weeks ago

What was fixed? I don't see any attached pull requests

MSA0208 commented 1 week ago

@cforce nothing much to fix on the opamp, so pull request not required, we have to check with the needed config.yaml for the collector to execute

@Asarew have you anytime verified with https using TLS certs, what kind of certs should we used here ? any idea on the certs to be used for the https communication between these 3 modules

MSA0208 commented 2 days ago

Have used the self signed certs generated using openSSL, but i get the error saying Failed to connect to the OpAMP server {"kind": "extension", "name": "opamp", "error": "tls: first record does not look like a TLS handshake" Connection failed (tls: first record does not look like a TLS handshake), will retry. {"kind": "extension", "name": "opamp", "client": "ws"}