treasure-data / chef-td-agent

Chef Cookbook for td-agent (Treasure Agent or Fluentd)
https://supermarket.chef.io/cookbooks/td-agent
Apache License 2.0
127 stars 120 forks source link

td-agent chef recipe fails during bootstrap due to systemd conflict #136

Open niclan opened 4 years ago

niclan commented 4 years ago

Td-agent cookbook 3.1.1, chef-client 14.13.11, CentOS 7.7.1908; fresh install, systemd 219-67.el7_7.2. Installs td-agent 3.5.1

When bootstrapping a node into chef with a command such as this:

knife bootstrap -u root -t rhel7-omnitruck u89-niclangf-01.int.vgnett.no -r "role[vgnett_base]"

the resource startup fails due to a systemd error:

[2020-01-02T08:56:14+01:00] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out
[2020-01-02T08:56:14+01:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-01-02T08:56:14+01:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: service[td-agent] (td-agent::configure line 77) had an error: Mixlib::ShellOut::ShellComma
ndFailed: Expected process to exit with [0], but received '1'
---- Begin output of /usr/bin/systemctl --system restart td-agent ----
STDOUT: 
STDERR: Job for td-agent.service failed because the control process exited with error code. See "systemctl status td-agent.service" and "journalctl -xe" for detail
s.
---- End output of /usr/bin/systemctl --system restart td-agent ----

System log:

Jan 02 08:56:12 u89-niclangf-01 systemd[1]: Starting td-agent: Fluentd based data collector for Treasure Data...
Jan 02 08:56:12 u89-niclangf-01 systemd[75018]: Failed at step RUNTIME_DIRECTORY spawning /opt/td-agent/embedded/bin/fluentd: File exists
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: td-agent.service: control process exited, code=exited status=233
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: Failed to start td-agent: Fluentd based data collector for Treasure Data.
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: Unit td-agent.service entered failed state.
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: td-agent.service failed.
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: td-agent.service holdoff time over, scheduling restart.
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: Stopped td-agent: Fluentd based data collector for Treasure Data.
Jan 02 08:56:12 u89-niclangf-01 systemd[1]: Starting td-agent: Fluentd based data collector for Treasure Data...

As you can see the process starts fine at the second try, but the first failure disrupts the bootstrap process.

The RUNTIME_DIRECTORY issue crops up in web searches, most of them old. This seems recent and directly relevant: https://github.com/puppetlabs/puppet_metrics_dashboard/issues/37

@sharpie quoth:

This appears to be a conflict between Puppet_metrics_dashboard::Service/Exec[Create Systemd temp Files] and systemd over who gets to create the /run/grafana directory. ... So, we probably need to drop the logic around creating the /run/grafana directory since systemd is now handling it.

The td-agent RPM package postinstall script goes like this:

if [ ! -e "/var/run/td-agent/" ]; then
  mkdir -p /var/run/td-agent/
fi

In other tickets @poettering says that systemd should only complain if the directory exists and has the wrong owners, but I have attempted creating the directory correctly by patching the chef recipe to no avail.

Since the SPEC file for the RPM package has been hard to find I've instead tried another fix in the chef recipe (td-agent::install):

directory "/var/run/td-agent" do
  action :nothing
end

package "td-agent" do
  retries 3
  retry_delay 10
  if node["td_agent"]["pinning_version"]
    action :install
    version node["td_agent"]["version"]
  else
    action :upgrade
  end
  notifies :delete, 'directory[/var/run/td-agent]', :immediately
end

which seems to have fixed it.

So it turns out that this is a issue with interaction between the RPM file and systemd and not the chef-cookbook. Still posting here since we only experience problems with this when using chef.

niclan commented 4 years ago

I'll email the person whose email address is found in the rpm package.