Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
When sending large event data from a scheduled highstate back to the master over ipc socket, salt-minion child process becomes defunct and we see over 1600 connections to the master (port 4506) in TIME_WAIT. Running salt-minion with trace, we see the following before the hang:
[DEBUG ] SaltEvent PUB socket URI: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pub.ipc
[DEBUG ] SaltEvent PULL socket URI: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pull.ipc
[TRACE ] IPCClient: Connecting to socket: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pull.ipc
[DEBUG ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion1234', 'fun': 'highstate.run', 'fun_args': ['minion_start'], 'schedule': 'highstate__minion_start', 'jid': '20240508161342306002', 'pid': 10847, 'return': {'highstate_type': 'minion_start', 'salt_boot_done': 'Tuesday February 01 2022 14:53', 'start_time': '2024-05-08T16:13:42.411586', 'file_|-bb.patch ...'retcode': 0, 'success': True, '_stamp': '2024-05-08T16:14:12.159598', 'out': 'highstate'}
<hang>
^C
[TRACE ] Waiting to kill process manager children
[DEBUG ] Closing IPCMessageClient instance
[DEBUG ] Closing IPCMessageSubscriber instance
[WARNING ] Minion received a SIGINT. Exiting.
[INFO ] Shutting down the Salt Minion
[TRACE ] Processing <function DaemonMixIn._mixin_before_exit at 0x7f3998635280>
[TRACE ] Processing <function LogLevelMixIn.__shutdown_logging at 0x7f3998630ca0>
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
The minion failed to return the job information for job 20240520183520712612. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Future <salt.ext.tornado.concurrent.Future object at 0x7f3988a36760> exception was never retrieved: Traceback (most recent call last):
File "/local/opt/saltcrystal/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
yielded = next(result)
File "/local/opt/saltcrystal/lib/python3.9/site-packages/salt/minion.py", line 2921, in handle_event
self._return_pub(data, ret_cmd="_return", sync=False)
File "/local/opt/saltcrystal/lib/python3.9/site-packages/salt/minion.py", line 2263, in _return_pub
log.trace("ret_val = %s", ret_val) # pylint: disable=no-member
UnboundLocalError: local variable 'ret_val' referenced before assignment
^CMinion received a SIGINT. Exiting.
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
^CMinion received a SIGINT. Exiting.
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
And the process' thread_1 is stuck on a lock in do_futex_wait.constprop.1():
[root@d400241-080 gs.d]# pstack 4034|grep -A10 ^"Thread 1"
Thread 1 (Thread 0x7fba8285d740 (LWP 4034)):
#0 0x00007fba82220b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1 0x00007fba82220bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2 0x00007fba82220c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3 0x00000000005413a9 in PyThread_acquire_lock_timed (lock=lock@entry=0x7fba50000dc0, microseconds=microseconds@entry=-1000000, intr_flag=intr_flag@entry=1) at Python/thread_pthread.h:483
#4 0x000000000059c614 in acquire_timed (timeout=-1000000000, lock=0x7fba50000dc0) at ./Modules/_threadmodule.c:63
#5 lock_PyThread_acquire_lock (self=0x7fba60ec0ea0, args=<optimized out>, kwds=<optimized out>) at ./Modules/_threadmodule.c:146
#6 0x00000000005dd63b in method_vectorcall_VARARGS_KEYWORDS (func=0x7fba82806540, args=0x7fba61f331d0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/descrobject.c:348
#7 0x0000000000425303 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>) at ./Include/cpython/abstract.h:118
#8 PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>) at ./Include/cpython/abstract.h:127
#9 trace_call_function (kwnames=<optimized out>, nargs=<optimized out>, args=<optimized out>, func=<optimized out>, tstate=<optimized out>) at Python/ceval.c:5058
[root@d400241-080 gs.d]# pstack 4034|grep -A3 ^"Thread"
Thread 7 (Thread 0x7fba70cbe700 (LWP 4036)):
#0 0x00007fba81831b43 in select () from /lib64/libc.so.6
#1 0x0000000000599e02 in pysleep (secs=<optimized out>) at ./Modules/timemodule.c:2036
#2 time_sleep (self=<optimized out>, obj=<optimized out>) at ./Modules/timemodule.c:365
--
Thread 6 (Thread 0x7fba61723700 (LWP 9589)):
#0 0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2 0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 5 (Thread 0x7fba61f24700 (LWP 9590)):
#0 0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2 0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 4 (Thread 0x7fba60d22700 (LWP 9602)):
#0 0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2 0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 3 (Thread 0x7fba5bfff700 (LWP 9603)):
#0 0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2 0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 2 (Thread 0x7fba5affd700 (LWP 16840)):
#0 0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fba7a14d1ea in select_epoll_poll_impl (self=0x7fba687eca10, maxevents=1023, timeout_obj=<optimized out>) at /builds/cme/saltcrystal/build/downloads/Python-3.9.16/Modules/selectmodule.c:1613
#2 select_epoll_poll (self=0x7fba687eca10, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /builds/cme/saltcrystal/build/downloads/Python-3.9.16/Modules/clinic/selectmodule.c.h:871
--
Thread 1 (Thread 0x7fba8285d740 (LWP 4034)):
#0 0x00007fba82220b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1 0x00007fba82220bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2 0x00007fba82220c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
[root@d400241-080 gs.d]#
Description
When sending large event data from a scheduled highstate back to the master over ipc socket, salt-minion child process becomes defunct and we see over 1600 connections to the master (port 4506) in TIME_WAIT. Running salt-minion with trace, we see the following before the hang:
And the process' thread_1 is stuck on a lock in do_futex_wait.constprop.1():
Strace output before the hang:
When the hightstate event doesn't hang (i.e. doesn't contain large amount of data) , we see the following instead:
More debugging info from minion running 3005.1 and 3007.0 attached. Both of these versions hang the same way, but 3002.2 works just fine.
We see no issues when switching ipc_mode to 'tcp', or setting ipc_write_buffer to something below 1000 in the minion configuration.
Additional context
Might be related to https://github.com/saltstack/salt/issues/65940 #65940