saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.09k stars 5.47k forks source link

[BUG] Master gets disconnected every few seconds when connecting to the minion. #66846

Open Urmila4718 opened 2 weeks ago

Urmila4718 commented 2 weeks ago

Description ### I have two master VMs and one minion. While debugging the logs on the minion server, I notice that the master is getting disconnected multiple times. Also this disconnection is causing the scheduled jobs to fail. and when i checked master status it showing some below issue.

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/event.py", line 348, in connect_pub
    self.subscriber = salt.transport.ipc_publish_client(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/base.py", line 210, in ipc_publish_client
    return publish_client(opts, io_loop, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/base.py", line 152, in publish_client
    return salt.transport.tcp.PublishClient(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py", line 220, in __init__
    super().__init__(opts, io_loop, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/base.py", line 398, in __init__
    super().__init__()
salt-minion -l debug
[DEBUG] LazyLoaded state.apply

[DEBUG] Minion of 'xxxx.xxx.xxxx2' is handling event tag '__master_connected'

[DEBUG] Minion of xxxx.xxx.xxxx2' is handling event tag '__master_req_channel_payload/xxxx.xxx.xxxx

[DEBUG] Minion return retry timer set to 8 seconds (randomized)

[DEBUG] Minion of 'xxxx.xxx.xxxx2' is handling event tag '__master_req_channel_payload/xxxx.xxx.xxxx2'

[DEBUG] Minion return retry timer set to 6 seconds (randomized)

[DEBUG] Minion of 'xxxx.xxx.xxxx2' is handling event tag '/salt/minion/minion_schedule_delete_complete'

[DEBUG] Minion of 'xxxx.xxx.xxxx2' is handling event tag '/salt/minion/minion_schedule_delete_complete'

[DEBUG] The functions from module 'mine' are being loaded by dir() on the loaded module

[DEBUG] LazyLoaded mine.update

[DEBUG] schedule.handle_func: adding this job to the jobcache with data {'id': 'xxxx.xxx.xxxx', 'fun': 'mine.update', 'fun_args': [], 'schedule': '__mine_interval', 'jid': '20240827102032661799', 'pid': 2024}

[DEBUG] The functions from module 'config' are being loaded by dir() on the loaded module

[DEBUG] LazyLoaded config.merge

[DEBUG] schedule.handle_func: Removing C:\ProgramData\Salt Project\Salt\var\cache\salt\minion\proc\20240827102032661799

[DEBUG] Subprocess Schedule(name=__mine_interval, jid=20240827102032661799) cleaned up

[DEBUG] schedule: Job __master_alive_xxxx.xxx.xxxx1 was scheduled with jid_include, adding to cache (jid_include defaults to True)

[DEBUG] schedule: Job __master_alive_xxxx.xxx.xxxx1 was scheduled with a max number of 1

[INFO] Running scheduled job: __master_alive_xxxx.xxx.xxxx1 with jid 20240827102132167770

[DEBUG] Subprocess Schedule(name=__master_alive_xxxx.xxx.xxxx1, jid=20240827102132167770) added

[DEBUG] schedule: Job __master_alive_xxxx.xxx.xxxx1 was scheduled with jid_include, adding to cache (jid_include defaults to True)

[DEBUG] schedule: Job __master_alive_xxxx.xxx.xxxx2 was scheduled with a max number of 1

[INFO] Running scheduled job: __master_alive_xxxx.xxx.xxxx2 with jid 20240827102132449051

[DEBUG] Subprocess Schedule(name=__master_alive_xxxx.xxx.xxxx2, jid=20240827102132449051) added

[DEBUG] schedule: Job __master_failback was scheduled with jid_include, adding to cache (jid_include defaults to True)

[DEBUG] schedule: Job __master_failback was scheduled with a max number of 1

[INFO] Running scheduled job: __master_failback with jid 20240827102132792819

[DEBUG] Subprocess Schedule(name=__master_failback, jid=20240827102132792819) added

[DEBUG] The functions from module 'statuspage' are being loaded by dir() on the loaded module

[DEBUG] The functions from module 'statuspage' are being loaded by dir() on the loaded module

[DEBUG] The functions from module 'status' are being loaded by dir() on the loaded module

[DEBUG] LazyLoaded status.master

[DEBUG] schedule.handle_func: adding this job to the jobcache with data {'id': 'xxxx.xxx.xxxx', 'fun': 'status.master', 'fun_args': [{'connected': True, 'master': 'xxxx.xxx.xxxx1'}], 'schedule': '__master_alive_xxxx.xxx.xxxx1', 'jid': '20240827102132167770', 'pid': 3984}

[DEBUG] The functions from module 'config' are being loaded by dir() on the loaded module

[DEBUG] LazyLoaded config.get

[DEBUG] Using selector: SelectSelector

[DEBUG] Popen(['git', 'version'], cwd=C:\Users\Administrator, stdin=None, shell=False, universal_newlines=False)

[DEBUG] Using selector: SelectSelector

[DEBUG] Publisher connecting to 127.0.0.1:4511

[DEBUG] The functions from module 'status' are being loaded by dir() on the loaded module

[DEBUG] LazyLoaded status.master

[DEBUG] schedule.handle_func: adding this job to the jobcache with data {'id': 'xxxx.xxx.xxxx', 'fun': 'status.master', 'fun_args': [{'master': 'xxxx.xxx.xxxx2', 'connected': True}], 'schedule': '__master_alive_salt-xxxx.xxx.xxxx2', 'jid': '20240827102132449051', 'pid': 6436}

[DEBUG] The functions from module 'config' are being loaded by dir() on the loaded module

[DEBUG] LazyLoaded config.get

[DEBUG] Closing _TCPPubServerPublisher instance

[DEBUG] Minion of 'xxxx.xxx.xxxx2' is handling event tag '__master_disconnected'

[DEBUG] Using selector: SelectSelector

[DEBUG] The functions from module 'statuspage' are being loaded by dir() on the loaded module

[DEBUG] schedule.handle_func: Removing C:\ProgramData\Salt Project\Salt\var\cache\salt\minion\proc\20240827102132167770

[DEBUG] Popen(['git', 'version'], cwd=C:\Users\Administrator, stdin=None, shell=False, universal_newlines=False)

[DEBUG] Using selector: SelectSelector

[DEBUG] Publisher connecting to 127.0.0.1:4511

[DEBUG] The functions from module 'status' are being loaded by dir() on the loaded module

[DEBUG] Closing _TCPPubServerPublisher instance

[DEBUG] Minion of 'xxxx.xxx.xxxx2' is handling event tag '__master_disconnected'

[INFO] Connection to master xxxx.xxx.xxxx2 lost

[DEBUG] Using selector: SelectSelector

[DEBUG] Using selector: SelectSelector

Setup Master configuration (/etc/salt/master)

interface: 10.166.145.32
file_roots:
  base:
    - /srv/salt/base
  dev:
    - /srv/salt/dev

pillar_roots:  
  base:
    - /srv/pillar

` Minion configuration

master:
    - xxxx.xxx.xxxx1
    - xxxx.xxx.xxxx2
file_client: remote
master_finger: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
verify_master_pubkey_sign: True 
always_verify_signature: True
master_type: failover
random_master: True
master_alive_interval: 60
retry_dns_count: 3
retry_dns: 0
master_tries: -1
master_failback: True
autosign_grains:
  - uuid

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior Basic test setup, restarting service, or restarting the minion machine.

Expected behavior If I run salt-minion -l debug it should see master connected and scheduled jobs should run from master. Screenshots If applicable, add screenshots to help explain your problem.

Versions Report Master :


Salt Version:
Salt: 3007.1

Python Version:
Python: 3.10.14 (main, Apr 3 2024, 21:30:09) [GCC 11.2.0]

Dependency Versions:
cffi: 1.16.0
cherrypy: unknown
dateutil: 2.8.2
docker-py: Not Installed
gitdb: 4.0.11
gitpython: 3.1.43
Jinja2: 3.1.4
libgit2: 1.7.1
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: 1.13.1
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
releenv: 0.16.0
smmap: 5.0.1
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4

Salt Package Information:
Package Type: onedir

System Versions:
dist: ubuntu 22.04.4 jammy
locale: utf-8
machine: x86_64
release: 5.15.0-117-generic
system: Linux
version: Ubuntu 22.04.4 jammy

Minion ::

Salt Version:
Salt: 3007.1

Python Version:
Python: 3.10.14 (heads/main
, Apr 3 2024, 21:36:37) [MSC v.1938 64 bit (AMD64)]

Dependency Versions:

cffi: 1.16.0
cherrypy: 18.8.0
dateutil: 2.8.2
docker-py: Not Installed
gitdb: 4.0.10
gitpython: Not Installed
Jinja2: 3.1.4
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.16.0
smmap: 5.0.1
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir

System Versions:

dist:
locale: utf-8
machine: AMD64
release: 2022Server
system: Windows
version: 2022Server 10.0.20348 SP0 Multiprocessor Free

Additional context

welcome[bot] commented 2 weeks ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

dwoz commented 1 week ago

@Urmila4718 Can you provide us with the complete traceback?

Urmila4718 commented 5 days ago

Hi @dwoz , We tested three minions connected to a single master, and it worked fine. The problem arises with the multi-master setup with scheduled tasks testing. Tested with both versions 3006.9 and 3007.1 (kept both master and minion version same while testing), Tried 3006.9 on master and 3007.1 on minion after someone suggested that it's working for them in issue, still getting the same issue. https://github.com/saltstack/salt/issues/65265 salt-master : 3006.9/3007.1 salt-minion: 3006.9/3007.1