saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.15k stars 5.47k forks source link

salt-call waits infinitely on apt-get of mysql server community edition #33442

Closed zerthimon closed 7 years ago

zerthimon commented 8 years ago

Description of Issue/Question

salt-call hangs on the following command:

[INFO    ] Executing command ['apt-get', '-q', '-y', '-o', 'DPkg::Options::=--force-confold', '-o', 'DPkg::Options::=--force-confdef', 'install', 'mysql-server'] in directory '/root'

in processes:

4 S root     30114 16398  0  80   0 - 152946 pipe_w 20:53 pts/1   00:00:01 /usr/bin/python /usr/bin/salt-call state.highstate
4 Z root     30339 30114  0  80   0 -     0 exit   20:53 pts/1    00:00:01 [apt-get] <defunct>
0 S mysql    30808     1  0  80   0 -  1112 wait   20:53 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe
0 S mysql    31598 30808  0  80   0 - 1004891 poll_s 20:53 ?      00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/storage/mysql --plugin-dir=/usr/lib/mysql/plugin --log-error=/var/log/mysql/mysqld.log --open-files-limit=10000 --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306

in strace:

root@db02:/storage# strace -p 28116
Process 28116 attached
read(57,

in lsof:

salt-call 28116 root   57r  FIFO    0,9      0t0  56123 pipe

Setup

root@infra01:/srv/salt/mysql# cat server.sls
{% from 'macros/macros.jinja' import get_local with context -%}
{% set root_password = get_local('mysql:root_creds:password', '') -%}

include:
  - scripts
  - zabbix.agent
  - apparmor

mysql_repo:
  pkgrepo.managed:
    - names:
      - deb http://repo.mysql.com/apt/ubuntu/ trusty mysql-5.7
    - file: /etc/apt/sources.list.d/mysql.list
    - keyserver: pgp.mit.edu
    - keyid: 5072E1F5
    - enabled: True
    - refresh_db: True

mysql_group:
  group.present:
    - name: mysql
    - system: True

mysql_user:
  user.present:
    - name: mysql
    - groups:
      - mysql
    - system: True
    - require:
      - group: mysql

mysql_mount_dir:
  file.directory:
    - name: /storage
    - user: root
    - group: root
    - mode: 755

mysql_data_dir:
  file.directory:
    - name: /storage/mysql
    - user: mysql
    - group: mysql
    - mode: 755
    - require:
      - user: mysql_user

mysql_etc_dir:
  file.directory:
    - name: /etc/mysql
    - user: root
    - group: root
    - mode: 755

mysql_config_file:
  file.managed:
    - name: /etc/mysql/my.cnf
    - source: salt://mysql/server/my.cnf
    - user: root
    - group: root
    - mode: 644
    - require:
      - file: mysql_etc_dir

mysql_conf.d_dir:
  file.directory:
    - name: /etc/mysql/conf.d
    - user: root
    - group: root
    - mode: 755
    - require:
      - file: mysql_etc_dir

mysql_safe_config_file:
  file.managed:
    - name: /etc/mysql/conf.d/mysqld_safe_syslog.cnf
    - source: salt://mysql/server/mysqld_safe_syslog.cnf
    - user: root
    - group: root
    - mode: 644
    - require:
      - file: mysql_conf.d_dir

mysql_tuning_config_file:
  file.managed:
    - name: /etc/mysql/conf.d/mysqld_tuning.cnf
    - source: salt://mysql/server/mysqld_tuning.cnf
    - user: root
    - group: root
    - mode: 644
    - require:
      - file: mysql_conf.d_dir

mysql_repl_config_file:
  file.managed:
    - name: /etc/mysql/conf.d/mysqld_replication.cnf
    - source: salt://mysql/server/mysqld_replication.cnf
    - user: root
    - group: root
    - mode: 644
    - require:
      - file: mysql_conf.d_dir

mysql_apparmor_config:
  file.managed:
    - name: /etc/apparmor.d/usr.sbin.mysqld
    - source: salt://mysql/server/usr.sbin.mysqld.apparmor
    - user: root
    - group: root
    - mode: 644
    - watch_in:
      - service: apparmor_service

mysql_server_debconf:
  debconf.set:
    - name: mysql-community-server
    - data:
        'mysql-community-server/root-pass': {'type': 'password', 'value': '{{ root_password }}'}
        'mysql-community-server/re-root-pass': {'type': 'password', 'value': '{{ root_password }}'}

mysql_server_packages:
  pkg.installed:
    - names:
      - mysql-server
      - python-mysqldb
    - require:
      - debconf: mysql_server_debconf
      - user: mysql_user
      - file: mysql_data_dir
      - file: mysql_config_file
      - file: mysql_safe_config_file
      - file: mysql_tuning_config_file
      - file: mysql_repl_config_file
      - file: mysql_apparmor_config

Steps to Reproduce Issue

[INFO    ] Running state [mysql-server] at time 20:53:22.501370
[INFO    ] Executing state pkg.installed for mysql-server
[INFO    ] Executing command 'apt-get -q update' in directory '/root'
[INFO    ] Executing command ['apt-get', '-q', '-y', '-o', 'DPkg::Options::=--force-confold', '-o', 'DPkg::Options::=--force-confdef', 'install', 'mysql-server'] in directory '/root'

Hangs forever, the install of the package is finished in the background, the DB is initialized. apt-get is a zombie.

Versions Report

root@db02:/storage# salt-call --versions-report
Salt Version:
           Salt: 2015.8.10

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: Not Installed
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.19.0-59-generic
         system: Ubuntu 14.04 trusty
Ch3LL commented 8 years ago

@zerthimon I'm not able to replicate this. Here is my test case:

root@vagrant-ubuntu-trusty-64:~# cat /srv/salt/test.sls
mysql_server_packages:
  pkg.installed:
    - names:
      - mysql-server
      - python-mysqldb

Run salt-call: salt-call --local state.sls test -ldebug

The packages are installed without delay:

Summary for local
------------
Succeeded: 2 (changed=2)
Failed:    0
------------
Total states run:     2
root@vagrant-ubuntu-trusty-64:~# salt --versions-report
Salt Version:
           Salt: 2015.8.10

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
           Mako: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.3
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-65-generic
         system: Ubuntu 14.04 trusty

Is there something in my test case above that is wrong that will help me to replicate this issue?

zerthimon commented 8 years ago

Now if the watched config file changes, the service restars and the salt-call doesn't finish. Waiting on:

INFO    ] Running state [mysql] at time 12:45:20.220713
[INFO    ] Executing state service.mod_watch for mysql
[INFO    ] Executing command ['service', 'mysql', 'status'] in directory '/root'
[INFO    ] Executing command ['service', 'mysql', 'restart'] in directory '/root'

The service is restarted, the init script is a zombie, salt-call never finishes:

4 S root      5450  2288  0  80   0 - 162966 pipe_w 12:45 pts/0   00:00:01 /usr/bin/python /usr/bin/salt-call state.highstate
0 Z root      5601  5450  0  80   0 -     0 exit   12:45 pts/0    00:00:00 [mysql] <defunct>
0 S mysql     5683     1  0  80   0 -  1112 wait   12:45 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe
0 S mysql     6477  5683  0  80   0 - 1004890 poll_s 12:45 ?      00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/storage/mysql --plugin-dir=/usr/lib/mysql/plugin --log-error=/var/log/mysql/mysqld.log --open-files-limit=10000 --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306

salt-call process straced:

root@db02:~# strace -p 5450
Process 5450 attached
read(57,

salt-call process lsof:

salt-call 5450 root   57r  FIFO    0,9      0t0  22461 pipe
zerthimon commented 8 years ago

Same situation when service is stopped and salt-call enforces the start:

[INFO    ] Executing command ['service', 'mysql', 'status'] in directory '/root'
[ERROR   ] Command ['service', 'mysql', 'status'] failed with return code: 3
[ERROR   ] output:  * MySQL Community Server 5.7.12 is not running
[INFO    ] Executing command ['service', 'mysql', 'start'] in directory '/root'
zerthimon commented 8 years ago

@Ch3LL To reproduce use the following SLS

root@db01:/srv/salt# cat test.sls 
mysql_repo:
  pkgrepo.managed:
    - names:
      - deb http://repo.mysql.com/apt/ubuntu/ trusty mysql-5.7
    - file: /etc/apt/sources.list.d/mysql.list
    - keyserver: pgp.mit.edu
    - keyid: 5072E1F5
    - enabled: True
    - refresh_db: True

mysql_server_debconf:
  debconf.set:
    - name: mysql-community-server
    - data:
        'mysql-community-server/root-pass': {'type': 'password', 'value': 'root'}
        'mysql-community-server/re-root-pass': {'type': 'password', 'value': 'root'}

mysql_server_packages:
  pkg.installed:
    - names:
      - mysql-server
    - require:
      - debconf: mysql_server_debconf
zerthimon commented 8 years ago

This may be related: https://github.com/saltstack/salt/issues/30255

zerthimon commented 8 years ago

Another related issue https://github.com/saltstack/salt/issues/16153

zerthimon commented 8 years ago

What's weird about this is that when I just do ...

service mysql restart

... in shell, it works fine, but when salt does that - it hangs the salt!

Edit: Having this on all servers with mysql server community edition installed from official MySQL (Oracle) repo. Edit Edit: @Ch3LL have you been able to replicate issue with the state I provided ?

zerthimon commented 8 years ago

another related issue: https://github.com/saltstack/salt/issues/9736

Ch3LL commented 8 years ago

@zerthimon i'm wondering if you are actually running into this issue #36116 as the issue states looks likethe service files for mysql have changed which has caused a hanging issue when trying to start the mysql community server service

zerthimon commented 8 years ago

@Ch3LL I have solved this problem for myself by changing the mysql init.d file:

I had to replace 2 lines (original lines are commented) to use debian's start-stop-daemon instead of su:

                        # su - mysql -s /bin/bash -c "/usr/sbin/mysqld --log_error_verbosity=2 2>&1 > /dev/null &"
                        start-stop-daemon --start --quiet --background --user mysql --exec /usr/sbin/mysqld

and:

                # su - mysql -s /bin/bash -c "mysqld_safe > /dev/null &"
                start-stop-daemon --start --quiet --background --exec /usr/bin/mysqld_safe

In my sls state I ensure that init.d file is created BEFORE the mysql community package is installed.

P.S.: So this seems to me, that su together with bash cause the problem ... I'm not sure what's exactly happening there (signals forwarding ? leaked descriptors ? tty problems ?).....

EDIT: There is a small utility https://github.com/tianon/gosu which is used in building docker containers and is made to overcome the problems caused by su. Just thought it may be solving the same problem we have.

zerthimon commented 8 years ago

Also, actually, another workaround: https://github.com/saltstack/salt/issues/9736#issuecomment-176351724 by @michalsuba which is excellent and shows that it's a problem of a leaked file descriptor.

Now there is a question, can salt detect such cases and not hang ?

Ch3LL commented 8 years ago

Awesome thanks for linking all these issues. Great work! Since it does indeed seem like #9736 is a duplciate do you mind if we close this here and track the issue there so we aren't managing multiple issues? If you have any further comments regarding the issue please comment on that issue if you agree this is a duplicate.

zerthimon commented 8 years ago

@Ch3LL yes this issue can be closed, however, my question still remains: can salt be taught to deal with such sutiation ?

Ch3LL commented 8 years ago

I'm not entirely sure on that answer or how we would handle that on our end but I would like to ping @dmurphy18 . Is this your area of expertise @dmurphy18 ? Do you know if this is something we can handle on salts side of things with how mysql is changing in their startup files?

dmurphy18 commented 8 years ago

@Ch3LL Do not think we can handle mysql changing their startup files, since it would be an endless task chasing every release. If there is an error in their init files, as suggested by dropping resources (leaked file descriptor), then the best suggestion is to file an upstream bug with them to fix it.

mbochenk commented 8 years ago

@dmurphy18 in my experience this occurs for couple of services (for example, java ones) that have init.d scripts not on par with ones by distro maintainers. While I agree that services doing it wrong should fix in upstream, would it be possible to have a workaround? For example, have a parameter like do_not_wait or detach in either service or cmd?

Using nohup did not work for me in cmd when using brute service mysqld start, and as far as I remember deb packages attempt to start services on instllation as opposed to rpm. It also makes (some) sense for cmd.

davidalger commented 7 years ago

So I'm not a user of salt, but found this issue while investigating an issue where I'm experiencing the same thing in an inline bash provisioner on a vagrant box. This helped me narrow down a reproducible test case, and I'm posting it here in case it helps others.

Running as root…

  1. Stop mysqld

service mysqld stop

  1. Attempt to start via the following

service mysqld start 2>&1 | cat

If I remove either the cat or the stderr redirect, the issue doesn't occur. If I alter /etc/init.d/mysqld to revert the change @mbochenk noted in #36116 had been made in mysql/mysql-server@684a165#diff-26228c60f7753d111a177fa638fb5838R283

The real life scenario in my case is more complicated than just running the output through cat, but it appears that any attempt to run the service call in a way which captures the stderr output into an stdin descriptor triggers the issue.

EDIT: Another change which works around the (apparent) leaked descriptor issues changing the start call in my test case to the following:

service mysqld start 2> >(cat)

This solved the issue for me since I'm essentially just needing to silence a bogus error message it spits out due the NFS mount we're running in Vagrant for the purpose of storing the mysql data on the host machine.

mbochenk commented 7 years ago

@davegiles thanks. I did not think of redirecting service output in cmd, need to try that out. This will not work for apt-get mysql though - unless not installed via pkg.

MarkMartinec commented 7 years ago

See [https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214643] [Bug 214643] net-mgmt/telegraf: service (re)start hangs when scripted: missing option -f with daemon(8)

I expect we are seeing the same issue here.

To quote the essential summary from that PR:

I came across this issue when trying to manage a service 'telegraf' with SaltStack (i.e. sysutils/py-salt). I would expect the same problem can occur under other management tools like Puppet/Chef/Ansible.

What happens is that when a salt-minion tries to start or restart a telegraf process by spawning a command: 'service telegraf restart', the process just hangs, waiting for a restart to happen, even though the telegraf process does restart successfully and is again running underneath. [...] $ service telegraf restart | cat (hangs) [...] So what happens is:

  • The process executing the daemon(8) program (as started by the 'service' command) has its stdout directed to a pipe;
  • The daemon(8) program disassociates from a controlling terminal but DOES NOT close its stdout, and then forks;
  • The forked daemon(8) subprocess inherits the open fd to a pipe;
  • The parent daemon(8) then exits, but the child daemon(8) still has a pipe open, connected to cat(1) (or to salt-minion);
  • the child daemon(8) process then exec's the telegraf program, which again inherits the pipe as its stdout.

On the SaltStack side I suppose the workaround is to provide an option to the cmd module to avoid it connecting a pipe to stdout and stderr of the spawned process.

MarkMartinec commented 7 years ago

To make myself explicit: services which behave as described above (a process terminates but passes its descriptors to other processes before exiting) may indeed not be behaving as nice citizens, but the bug lies entirely in SaltStack.

When SaltStack spawns some process (i.e. runs a command) and creates a pipe to its stdout and stderr, SaltStack must establish a SIGCHLD handler. When the spawned process terminates, any I/O on such pipes must be terminated (even if the pipe is still open on the other side). Promptly reclaiming (waitpid) the exit status of a terminated child process also avoids having zombies lying around.

smparekh commented 7 years ago

Hey all, Been struggling with the issue here when trying to start a process via a custom service file on Suse 11. Salt was hanging on service.running state where my start code was /sbin/startproc $BIN. I finally solved it using the -q flag on startproc /sbin/startproc -q $BIN.

I think the root cause for me was that the binary I am starting keeps a STDOUT session open and even though doing a manual service BIN start was exiting correctly, calling the state via salt was not.

MarkMartinec commented 7 years ago

Please remove the 'Info Needed' blocking label. The issue has been explained and necessary information provided.

Ch3LL commented 7 years ago

Thanks for everyones investigation here. There is this issue here https://github.com/saltstack/salt/issues/9736 and there is discussion there about adding an additional parameter to help workaround this issue. I'm going to close this issue so we can track it there. If anyone has any ideas about the additional parameter please discuess there. Thanks