monitoringartist / zabbix-docker-monitoring

:whale: Docker/Kubernetes/Mesos/Marathon/Chronos/LXC/LXD/Swarm container monitoring - Docker image, Zabbix template and C module
https://hub.docker.com/r/monitoringartist/zabbix-agent-xxl-limited/
GNU General Public License v2.0
1.19k stars 268 forks source link

Zabbix agent stucks on server reboot #121

Closed skokhanovskiy closed 5 years ago

skokhanovskiy commented 5 years ago

Zabbix agent with the zabbix_module_docker.so module stucks on server reboot. Only manual restarting of the zabbix agent service helps in that cases.

Version of zabbix agent:

# zabbix_agentd --version
zabbix_agentd (daemon) (Zabbix) 4.0.7
Revision 92831 18 April 2019, compilation time: Apr 18 2019 07:53:42

Copyright (C) 2019 Zabbix SIA
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it according to
the license. There is NO WARRANTY, to the extent permitted by law.

This product includes software developed by the OpenSSL Project
for use in the OpenSSL Toolkit (http://www.openssl.org/).

Compiled with OpenSSL 1.1.0f  25 May 2017
Running with OpenSSL 1.1.0j  20 Nov 2018

Version of zabbix-docker-monitoring: latest compiled from master.

Logs of stucked zabbix agent:

   547:20190513:111955.787 Starting Zabbix Agent [dev-hashmart-01]. Zabbix 4.0.7 (revision 92831).
   547:20190513:111955.818 **** Enabled features ****
   547:20190513:111955.818 IPv6 support:          YES
   547:20190513:111955.818 TLS support:           YES
   547:20190513:111955.818 **************************
   547:20190513:111955.818 using configuration file: /etc/zabbix/zabbix_agentd.conf
   547:20190513:111955.818 In zbx_load_modules()
   547:20190513:111955.818 loading module "/usr/lib/zabbix/modules/zabbix_module_docker.so"
   547:20190513:111955.860 In zbx_module_api_version()
   547:20190513:111955.860 In zbx_module_init()
   547:20190513:111955.860 zabbix_module_docker v0.6.9, compilation time: Apr 25 2019 07:38:36
   547:20190513:111955.860 In zbx_docker_dir_detect()
   547:20190513:111955.882 Detected docker stat directory: /sys/fs/cgroup/
   547:20190513:111955.882 Cannot detect used docker driver
   547:20190513:111955.882 In zbx_docker_api_detect()
   547:20190513:111955.882 In zbx_docker_perm()
   547:20190513:111955.882 zabbix agent user has docker perm
   547:20190513:111955.882 In zbx_module_docker_socket_query()
   547:20190513:111955.883 Docker's socket query: GET /_ping HTTP/1.0

After that nothing happens for a long time.

I've tried add docker service as dependency for the zabbix agent service in systemd.

# mkdir -p /etc/systemd/system/zabbix-agent.service.wants
# ln -s /etc/systemd/system/docker.service /etc/systemd/system/zabbix-agent.service.wants/docker.service
# systemctl daemon-reload
# systemctl list-dependencies zabbix-agent.service
zabbix-agent.service
● ├─docker.service
● ├─system.slice
● └─sysinit.target
●   ├─dev-hugepages.mount
...

~But this didn't help.~ Look at https://github.com/monitoringartist/zabbix-docker-monitoring/issues/121#issuecomment-491810760

I found that socket timeouts defined here: https://github.com/monitoringartist/zabbix-docker-monitoring/blob/27709c75b74e6404295b5b56b846b4e3b6d8f982/src/modules/zabbix_module_docker/zabbix_module_docker.c#L172-L180 For timeouts values used fields form the stimeout struct that initialized in the zbx_module_item_timeout function: https://github.com/monitoringartist/zabbix-docker-monitoring/blob/27709c75b74e6404295b5b56b846b4e3b6d8f982/src/modules/zabbix_module_docker/zabbix_module_docker.c#L105-L119 But this function called by zabbix agent after this query. No In zbx_module_item_timeout() string in logs confirms my hunch. I think first ping query makes with zero (i.e. infinitely) timeout and this request is infinitely executes in the not yet fully satarted docker.

jangaraj commented 5 years ago

Try to simulate agent communication with Docker API (ping) in cmd, pls:

curl --unix-socket /var/run/docker.sock http:/_ping
skokhanovskiy commented 5 years ago

Guess that you mean http://localhost/_ping in curl command line. Here it is:

# docker --version
Docker version 18.09.0, build 4d60db4
# curl -v --unix-socket /var/run/docker.sock http://localhost/_ping
*   Trying /var/run/docker.sock...
* Connected to localhost (/var/run/docker.sock) port 80 (#0)
> GET /_ping HTTP/1.1
> Host: localhost
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Api-Version: 1.39
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/18.09.5 (linux)
< Date: Mon, 13 May 2019 12:31:37 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
<
* Curl_http_done: called premature == 0
* Connection #0 to host localhost left intact

Once again draw attention to the fact that restarting of the zabbix-agent service fixes the problem when the docker is already running and loaded. The described behavior occurs only when the server is booting.

skokhanovskiy commented 5 years ago

~To workaround this bug i add a delay between starting docker and zabbix-agent services. I added a timer module to the systemd that starts the zabbix-agent service after 15 seconds after starting docker.~ To workaround this issue I change configuration of the zabbix-agent systemd unit. This changes boot order and systemd on boot starts zabbix-agent only when docker service is already running.

$ cat /etc/systemd/system/zabbix-agent.service.d/docker.conf

[Unit]
Wants=docker.service
After=docker.service
# systemctl daemon-reload

This helps, but the error in the module logic is still there.

jangaraj commented 5 years ago

Thank you for more details. Will you be able to create pull request, which will fix broken module logic, please? It looks like a problem with socket timeout.

skokhanovskiy commented 5 years ago

@jangaraj #127 should fix this issue.

  853:20190604:093448.726 Starting Zabbix Agent [orn-runners-01]. Zabbix 4.0.8 (revision 2b50c941de).
   853:20190604:093448.726 **** Enabled features ****
   853:20190604:093448.726 IPv6 support:          YES
   853:20190604:093448.726 TLS support:           YES
   853:20190604:093448.726 **************************
   853:20190604:093448.726 using configuration file: /etc/zabbix/zabbix_agentd.conf
   853:20190604:093448.726 In zbx_load_modules()
   853:20190604:093448.726 loading module "/usr/lib/zabbix/modules/zabbix_module_docker.so"
   853:20190604:093449.037 In zbx_module_api_version()
   853:20190604:093449.037 In zbx_module_init()
   853:20190604:093449.037 zabbix_module_docker v0.6.9, compilation time: Jun  4 2019 18:22:14
   853:20190604:093449.037 In zbx_docker_dir_detect()
   853:20190604:093449.037 Detected docker stat directory: /sys/fs/cgroup/
   853:20190604:093449.037 Cannot detect used docker driver
   853:20190604:093449.037 In zbx_docker_api_detect()
   853:20190604:093449.037 In zbx_docker_perm()
   853:20190604:093449.037 zabbix agent user has docker perm
   853:20190604:093449.037 In zbx_module_docker_socket_query()
   853:20190604:093449.037 Docker's socket query: GET /_ping HTTP/1.0
!  853:20190604:093519.298 Docker's socket response: [{}]
   853:20190604:093519.298 Docker's socket doesn't work - only basic docker metrics are available
   853:20190604:093519.298 In zbx_module_item_list()
   853:20190604:093519.298 In zbx_module_item_timeout()
   853:20190604:093519.298 cannot find "zbx_module_history_write_cbs()" function in module "zabbix_module_docker.so": /usr/lib/zabbix/modules/zabbix_module_docker.so: undefined symbol: zbx_module_histor
y_write_cbs
   853:20190604:093519.298 loaded modules: zabbix_module_docker.so

Waiting for review.