signalfx / splunk-otel-collector

Apache License 2.0
202 stars 156 forks source link

Agent stopped on upgrade but not restarted #795

Closed jmapro closed 2 years ago

jmapro commented 3 years ago

On update the collector agent is stopped and disabled by the preinstall.sh script but the postinstall.sh script never restart it.

So when we do automatic system upgrade we have to do manual actions to restart the agent. I think the agent must be restarted on upgrade.

Tested OS:

jeffreyc-splunk commented 3 years ago

Thanks @jmapro. We'll take a look at at how best to address this.

jmapro commented 2 years ago

Hi !

I just had an update to v0.43.0 and this bug seems to be still present. My service was stopped and disabled after the update.

$ systemctl status splunk-otel-collector
● splunk-otel-collector.service - Splunk OpenTelemetry Collector
     Loaded: loaded (/lib/systemd/system/splunk-otel-collector.service; disabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/splunk-otel-collector.service.d
             └─service-owner.conf
     Active: inactive (dead)
||/ Name                  Version      Architecture Description
+++-=====================-============-============-=================================
iU  splunk-otel-collector 0.43.0       amd64        Splunk OpenTelemetry Collector
jeffreyc-splunk commented 2 years ago

@jmapro Please see if there are any errors in the journald logs (sudo journalctl -u splunk-otel-collector), or when starting the collector manually (otelcol --config=<path to your config file>).

jmapro commented 2 years ago
Feb 07 02:02:27  systemd[1]: Stopping Splunk OpenTelemetry Collector...
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.921Z        info        service/collector.go:166        Received signal from OS        {"signal": "terminated"}
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.923Z        info        service/collector.go:255        Starting shutdown...
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.926Z        info        healthcheck/handler.go:129        Health Check state change        {"kind": "extension", "name": "health_check", "st>
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.931Z        info        service/service.go:121        Stopping receivers...
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.941Z        info        prometheusexecreceiver@v0.41.0/receiver.go:252        Subprocess start delay        {"kind": "receiver", "name": "pr>
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.953Z        info        service/service.go:126        Stopping processors...
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.953Z        info        builder/pipelines_builder.go:73        Pipeline is shutting down...        {"name": "pipeline", "name": "metrics"}
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        builder/pipelines_builder.go:77        Pipeline is shutdown.        {"name": "pipeline", "name": "metrics"}
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        builder/pipelines_builder.go:73        Pipeline is shutting down...        {"name": "pipeline", "name": "metrics/int>
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        builder/pipelines_builder.go:77        Pipeline is shutdown.        {"name": "pipeline", "name": "metrics/internal"}
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        builder/pipelines_builder.go:73        Pipeline is shutting down...        {"name": "pipeline", "name": "metrics/squ>
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        builder/pipelines_builder.go:77        Pipeline is shutdown.        {"name": "pipeline", "name": "metrics/squid"}
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        service/service.go:131        Stopping exporters...
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        service/service.go:136        Stopping extensions...
Feb 07 02:02:27  otelcol[229900]: 2022-02-07T02:02:27.954Z        info        service/collector.go:273        Shutdown complete.
Feb 07 02:02:27  systemd[1]: splunk-otel-collector.service: Succeeded.
Feb 07 02:02:27  systemd[1]: Stopped Splunk OpenTelemetry Collector.
-- Reboot --
Feb 07 09:20:58  systemd[1]: Started Splunk OpenTelemetry Collector.
Feb 07 09:20:59  otelcol[54160]: 2022/02/07 09:20:59 main.go:280: Set config to /etc/otel/collector/agent_config.yaml
Feb 07 09:20:59  otelcol[54160]: 2022/02/07 09:20:59 main.go:346: Set ballast to 168 MiB
Feb 07 09:20:59  otelcol[54160]: 2022/02/07 09:20:59 main.go:360: Set memory limit to 460 MiB
Feb 07 09:20:59  otelcol[54160]: 2022/02/07 09:20:59 remove_ballast_key.go:41: [WARNING] `ballast_size_mib` parameter in `memory_limiter` processor is deprecated. Please update the config accord>
Feb 07 09:20:59  otelcol[54160]: 2022/02/07 09:20:59 move_otlp_insecure.go:42: Unsupported key found: exporters::otlp::insecure. Moving to exporters::otlp::tls::insecure

the agent was stopped at 2:02:27 this morning for a server reboot after system patch. The service did not restart until I manually start it. This is because the deb package update has put the service in disabled state.

I had some non blocking errors in logs. The agent start and send all other metrics.

Feb 07 02:01:32  otelcol[229900]: 2022-02-07T02:01:32.127Z        error        subprocessmanager/manager.go:101        subprocess output line        {"kind": "receiver", "name": "prometheus_exec/
squid", "output": "2022/02/07 02:01:32 servicec times - could not parse line: Service Time Percentiles            5 min    60 min:"}
Feb 07 02:01:32   otelcol[229900]: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusexecreceiver/subprocessmanager.(*SubprocessConfig).pipeSubprocessOutput
Feb 07 02:01:32   otelcol[229900]:         /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusexecreceiver@v0.
41.0/subprocessmanager/manager.go:101
jeffreyc-splunk commented 2 years ago

Thanks @jmapro. I ran some basic upgrade tests, but have not yet been able to reproduce the issue. I'll continue to investigate, but please provide any additional info if possible.

  1. How was the collector package upgraded? With apt, or manually with dpkg, or some other method?
  2. Can you find the logs from the upgrade, maybe from /var/log/apt/*.log or /var/log/dpkg.log?
  3. Did the issue only occur after upgrading deb packages, or also with rpm upgrades?
jmapro commented 2 years ago

Thanks for your help @jcheng-splunk . I found an issue in my apt configuration, I have to force oldconf or something like that.

Start-Date: 2022-02-07  02:02:26
Commandline: apt-get -y --only-upgrade true install splunk-otel-collector=0.43.0
Requested-By: nxautomation (995)
Upgrade: splunk-otel-collector:amd64 (0.41.0, 0.43.0)
Error: Sub-process /usr/bin/dpkg returned an error code (1)
End-Date: 2022-02-07  02:03:01

Log started: 2022-02-07  04:10:41
Setting up splunk-otel-collector (0.43.0) ...

Configuration file '/etc/otel/collector/agent_config.yaml'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** agent_config.yaml (Y/I/N/O/D/Z) [default=N] ? dpkg: error processing package splunk-otel-collector (--configure):
 end of file on stdin at conffile prompt
Errors were encountered while processing:
 splunk-otel-collector
Log ended: 2022-02-07  04:10:42