saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

Master process dies due to clear_old_jobs #18716

Closed syphernl closed 8 years ago

syphernl commented 9 years ago

In our development environment our Salt-master has died. It happened straight after we moved some files but that may be a coincidence.

The master log contains:

2014-12-04 12:34:59,383 [salt.master      ][INFO    ] AES payload received with command _master_opts
2014-12-04 12:34:59,386 [salt.master      ][INFO    ] AES payload received with command _file_list
2014-12-04 12:35:01,799 [salt.utils.process][INFO    ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27263) died with exit status None, restarting...
2014-12-04 12:35:04,049 [salt.utils.process][INFO    ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27286) died with exit status None, restarting...
2014-12-04 12:35:06,217 [salt.utils.process][INFO    ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27309) died with exit status None, restarting...
2014-12-04 12:35:08,432 [salt.utils.process][INFO    ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27332) died with exit status None, restarting...
2014-12-04 12:35:10,670 [salt.utils.process][INFO    ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27355) died with exit status None, restarting...
2014-12-04 12:35:13,028 [salt.utils.process][INFO    ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27378) died with exit status None, restarting...

This keeps going on for a few minutes until it simply stopped and no master processes remained. Dmesg shows a lot of errors/problems as well which all started at the same time:

[do dec  4 12:40:23 2014] BUG: unable to handle kernel NULL pointer dereference at           (null)
[do dec  4 12:40:23 2014] IP: [<          (null)>]           (null)
[do dec  4 12:40:23 2014] PGD 464c3067 PUD 7bbfb067 PMD 0 
[do dec  4 12:40:23 2014] Oops: 0010 [#149] SMP 
[do dec  4 12:40:23 2014] Modules linked in: btrfs raid6_pq xor ufs msdos xfs libcrc32c vboxsf(OX) nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache dm_crypt ppdev serio_raw parport_pc parport vboxguest(OX) psmouse ahci libahci e1000
[do dec  4 12:40:23 2014] CPU: 0 PID: 30685 Comm: salt-master Tainted: G      D    OX 3.13.0-39-generic #66-Ubuntu
[do dec  4 12:40:23 2014] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[do dec  4 12:40:23 2014] task: ffff88007b600000 ti: ffff880051caa000 task.ti: ffff880051caa000
[do dec  4 12:40:23 2014] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[do dec  4 12:40:23 2014] RSP: 0018:ffff880051cabd00  EFLAGS: 00010246
[do dec  4 12:40:23 2014] RAX: ffffffffa01f8140 RBX: 0000000000000000 RCX: 0000000054804828
[do dec  4 12:40:23 2014] RDX: 000000000000017c RSI: ffff880051cabdd0 RDI: ffff88005517c9c0
[do dec  4 12:40:23 2014] RBP: ffff880051cabd88 R08: 0000000054804828 R09: 0000000000000001
[do dec  4 12:40:23 2014] R10: ffffea0000de08c0 R11: ffffffffa009eb4a R12: ffff88007b600000
[do dec  4 12:40:23 2014] R13: ffff880051cabdd0 R14: ffff880051cabd50 R15: ffff88005517c9c0
[do dec  4 12:40:23 2014] FS:  00007f89458fb740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[do dec  4 12:40:23 2014] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[do dec  4 12:40:23 2014] CR2: 0000000000000000 CR3: 000000007c302000 CR4: 00000000000006f0
[do dec  4 12:40:23 2014] Stack:
[do dec  4 12:40:23 2014]  ffffffff811cbe09 ffffffff81312ecb ffff88007b600000 ffff880051cabe60
[do dec  4 12:40:23 2014]  ffff88007b600000 0000000000000000 ffff88005b80fda8 0000000000000024
[do dec  4 12:40:23 2014]  ffff88007a9b3ca0 ffff88005517c9c0 ffff88007a9b3ca0 ffff88005517c9c0
[do dec  4 12:40:23 2014] Call Trace:
[do dec  4 12:40:23 2014]  [<ffffffff811cbe09>] ? path_lookupat+0x439/0x790
[do dec  4 12:40:23 2014]  [<ffffffff81312ecb>] ? apparmor_file_alloc_security+0x5b/0x180
[do dec  4 12:40:23 2014]  [<ffffffff811cc18b>] filename_lookup+0x2b/0xc0
[do dec  4 12:40:23 2014]  [<ffffffff811cdd24>] user_path_at_empty+0x54/0x90
[do dec  4 12:40:23 2014]  [<ffffffff810c7e88>] ? __call_rcu_nocb_enqueue+0xa8/0xc0
[do dec  4 12:40:23 2014]  [<ffffffff811cdd71>] user_path_at+0x11/0x20
[do dec  4 12:40:23 2014]  [<ffffffff811c1f90>] vfs_fstatat+0x50/0xa0
[do dec  4 12:40:23 2014]  [<ffffffff811c242f>] SYSC_newstat+0x1f/0x40
[do dec  4 12:40:23 2014]  [<ffffffff8108822c>] ? task_work_run+0xac/0xe0
[do dec  4 12:40:23 2014]  [<ffffffff81013df7>] ? do_notify_resume+0x97/0xb0
[do dec  4 12:40:23 2014]  [<ffffffff8172faaa>] ? int_signal+0x12/0x17
[do dec  4 12:40:23 2014]  [<ffffffff811c267e>] SyS_newstat+0xe/0x10
[do dec  4 12:40:23 2014]  [<ffffffff8172f7ed>] system_call_fastpath+0x1a/0x1f
[do dec  4 12:40:23 2014] Code:  Bad RIP value.
[do dec  4 12:40:23 2014] RIP  [<          (null)>]           (null)
[do dec  4 12:40:23 2014]  RSP <ffff880051cabd00>
[do dec  4 12:40:23 2014] CR2: 0000000000000000
[do dec  4 12:40:23 2014] ---[ end trace 93a5bb8f5a25fa9b ]---

This is on Ubuntu 14.04 running in Virtualbox (Vagrant) with version details:

           Salt: 2014.7.0
         Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.0.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: 0.9.1
cachedout commented 9 years ago

Could you elaborate on what you mean by "moved some files", please?

On Thu, Dec 4, 2014 at 4:42 AM, Frank notifications@github.com wrote:

In our development environment our Salt-master has died. It happened straight after we moved some files but that may be a coincidence.

The master log contains:

2014-12-04 12:34:59,383 [salt.master ][INFO ] AES payload received with command _master_opts 2014-12-04 12:34:59,386 [salt.master ][INFO ] AES payload received with command _file_list 2014-12-04 12:35:01,799 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27263) died with exit status None, restarting... 2014-12-04 12:35:04,049 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27286) died with exit status None, restarting... 2014-12-04 12:35:06,217 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27309) died with exit status None, restarting... 2014-12-04 12:35:08,432 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27332) died with exit status None, restarting... 2014-12-04 12:35:10,670 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27355) died with exit status None, restarting... 2014-12-04 12:35:13,028 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x7f893d87bbd0>> (27378) died with exit status None, restarting...

This keeps going on for a few minutes until it simply stopped and no master processes remained. Dmesg shows a lot of errors/problems as well which all started at the same time:

[do dec 4 12:40:23 2014] BUG: unable to handle kernel NULL pointer dereference at (null) [do dec 4 12:40:23 2014] IP: < (null)> [do dec 4 12:40:23 2014] PGD 464c3067 PUD 7bbfb067 PMD 0 [do dec 4 12:40:23 2014] Oops: 0010 [#149] SMP [do dec 4 12:40:23 2014] Modules linked in: btrfs raid6_pq xor ufs msdos xfs libcrc32c vboxsf(OX) nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache dm_crypt ppdev serio_raw parport_pc parport vboxguest(OX) psmouse ahci libahci e1000 [do dec 4 12:40:23 2014] CPU: 0 PID: 30685 Comm: salt-master Tainted: G D OX 3.13.0-39-generic #66-Ubuntu [do dec 4 12:40:23 2014] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [do dec 4 12:40:23 2014] task: ffff88007b600000 ti: ffff880051caa000 task.ti: ffff880051caa000 [do dec 4 12:40:23 2014] RIP: 0010:[<0000000000000000>] < (null)> [do dec 4 12:40:23 2014] RSP: 0018:ffff880051cabd00 EFLAGS: 00010246 [do dec 4 12:40:23 2014] RAX: ffffffffa01f8140 RBX: 0000000000000000 RCX: 0000000054804828 [do dec 4 12:40:23 2014] RDX: 000000000000017c RSI: ffff880051cabdd0 RDI: ffff88005517c9c0 [do dec 4 12:40:23 2014] RBP: ffff880051cabd88 R08: 0000000054804828 R09: 0000000000000001 [do dec 4 12:40:23 2014] R10: ffffea0000de08c0 R11: ffffffffa009eb4a R12: ffff88007b600000 [do dec 4 12:40:23 2014] R13: ffff880051cabdd0 R14: ffff880051cabd50 R15: ffff88005517c9c0 [do dec 4 12:40:23 2014] FS: 00007f89458fb740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [do dec 4 12:40:23 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [do dec 4 12:40:23 2014] CR2: 0000000000000000 CR3: 000000007c302000 CR4: 00000000000006f0 [do dec 4 12:40:23 2014] Stack: [do dec 4 12:40:23 2014] ffffffff811cbe09 ffffffff81312ecb ffff88007b600000 ffff880051cabe60 [do dec 4 12:40:23 2014] ffff88007b600000 0000000000000000 ffff88005b80fda8 0000000000000024 [do dec 4 12:40:23 2014] ffff88007a9b3ca0 ffff88005517c9c0 ffff88007a9b3ca0 ffff88005517c9c0 [do dec 4 12:40:23 2014] Call Trace: [do dec 4 12:40:23 2014] [] ? path_lookupat+0x439/0x790 [do dec 4 12:40:23 2014] [] ? apparmor_file_alloc_security+0x5b/0x180 [do dec 4 12:40:23 2014] [] filename_lookup+0x2b/0xc0 [do dec 4 12:40:23 2014] [] user_path_at_empty+0x54/0x90 [do dec 4 12:40:23 2014] [] ? __call_rcu_nocb_enqueue+0xa8/0xc0 [do dec 4 12:40:23 2014] [] user_path_at+0x11/0x20 [do dec 4 12:40:23 2014] [] vfs_fstatat+0x50/0xa0 [do dec 4 12:40:23 2014] [] SYSC_newstat+0x1f/0x40 [do dec 4 12:40:23 2014] [] ? task_work_run+0xac/0xe0 [do dec 4 12:40:23 2014] [] ? do_notify_resume+0x97/0xb0 [do dec 4 12:40:23 2014] [] ? int_signal+0x12/0x17 [do dec 4 12:40:23 2014] [] SyS_newstat+0xe/0x10 [do dec 4 12:40:23 2014] [] system_call_fastpath+0x1a/0x1f [do dec 4 12:40:23 2014] Code: Bad RIP value. [do dec 4 12:40:23 2014] RIP < (null)> [do dec 4 12:40:23 2014] RSP [do dec 4 12:40:23 2014] CR2: 0000000000000000 [do dec 4 12:40:23 2014] ---[ end trace 93a5bb8f5a25fa9b ]---

This is on Ubuntu 14.04 running in Virtualbox (Vagrant) with version details:

       Salt: 2014.7.0
     Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
     Jinja2: 2.7.2
   M2Crypto: 0.21.1

msgpack-python: 0.3.0 msgpack-pure: Not Installed pycrypto: 2.6.1 libnacl: Not Installed PyYAML: 3.10 ioflo: Not Installed PyZMQ: 14.0.1 RAET: Not Installed ZMQ: 4.0.4 Mako: 0.9.1

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/18716.

syphernl commented 9 years ago

Some state files were moved from one folder to another. Nothing special, just a bunch of .SLS files.

pitatus commented 9 years ago

Seen before (reproducible)? ..with or without the 'moved some files' activity?

Any idea how much memory is available in the VM, any other apps that may be competing for memory, and if the VM was getting low on available memory or disk space at the time..?

Do you see any other log activity leading up to this event that could be helpful (kern.log, syslog)?

joshland commented 9 years ago

I don't know it if is the same situation. I was able to reproduce this via the following error. I think it's the same error, and I had a bug in my "master" file.

At the very least, this would be something worth reporting more clearly / detecting.

Log: 2015-03-12 21:56:05,018 [salt.utils.process][DEBUG ] Started '<bound method Master._clear_oldjobs of <salt.master.Master object at 0x1daa110>>'([], _{} with pid 29596 2015-03-12 21:56:05,986 [salt.utils.process][INFO ] Process <bound method Master._clear_old_jobs of <salt.master.Master object at 0x1daa110>> (29596) died with exit status None, restarting... 2015-03-12 21:56:06,991 [salt.utils.process][DEBUG ] Started '<bound method Master._clear_oldjobs of <salt.master.Master object at 0x1daa110>>'([], _{} with pid 29622 2015-03-12 22:04:56,794 [salt.utils.process][DEBUG ] Started '<bound method Master._clear_oldjobs of <salt.master.Master object at 0x3316110>>'([], _{} with pid 29778 2015-03-12 22:15:07,835 [salt.utils.process][DEBUG ] Started '<bound method Master._clear_oldjobs of <salt.master.Master object at 0x35410d0>>'([], _{} with pid 30338

I had inadvertently killed the "file_ignore_regex:" stanza. Broken:

hash_type: sha256
  - '/\.git($|/)'

Working:

hash_type: sha256

file_ignore_regex:
   - '/\.git($|/)'
rallytime commented 9 years ago

Thanks for the extra info here @joshland. @syphernl Could that possibly be the same thing you were seeing? Do either of you know if this is still a problem on a newer release? (2014.7.5 is the latest stable right now.)