saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.22k stars 5.49k forks source link

[DOCS] Master Cluster Tutorial - Suggested HAProxy Timeout Settings Cause Unstable Communication #66888

Open clayoster opened 2 months ago

clayoster commented 2 months ago

Description I have been testing the suggested HAProxy configuration from the Master Cluster tutorial and have found that the suggested client/server timeout values of 1m cause unstable minion communication, specifically with the publisher port (4505).

I am using the default transport with 3 masters and 50 minions all running 3007.1. My HAProxy version is 2.6.12-1 (Debian 12), though I have tested older and newer versions with the same results.

Adjusting TCP keepalive values on the masters and minions does not seem to affect HAProxy closing TCP sessions after 1 minute of inactivity. Reducing tcp_keepalive_idle does speed up minions reconnecting after HAProxy closes the connection though.

It seems that no matter how frequently the master and minions send keepalives, HAProxy will close the sessions after 1 minute if no data is sent through the session. If I run something like salt '*' test.ping every 30 seconds, this keeps the sessions with the publisher port alive for longer than 1 minute.

To Reproduce:

I am currently using timeout values of 12h on the publisher and request server ports to reduce the frequency TCP sessions being killed off. While this probably isn't the best solution, it does keep minion communication stable as it greatly reduces how often minions have to re-establish their connection with the master.

Suggested Fix Is there other configuration expected to be set on the master and minion to allow stable minion communication with the suggested 1 minute timeouts in HAProxy?

Type of documentation Tutorial

Location or format of documentation https://docs.saltproject.io/en/latest/topics/tutorials/master-cluster.html

dwoz commented 2 months ago

Yes the timeouts for the publish port should match the 'publish_session'. The docs need to be updated. This has been in my backlog for some time.

clayoster commented 2 months ago

@dwoz Would you recommend leaving publish_session at the default of 24 hours, or setting it to something lower? Additionally, would it be a good idea to set ping_on_rotate: True in this configuration?

anthonyra commented 2 months ago

I'm also trying to follow this tutorial and can't get the minion to "sign_in" with one of the masters. Even if I directly run the salt-minion with the salt-master in question. @clayoster are you minion configs simply having the master: salt.example.com or is there other settings needed?

clayoster commented 2 months ago

@anthonyra Correct, that is the only master-related setting in my minion config file. Using your example, salt.example.com points to my HAProxy server which load balances connections to 3 masters. My HAProxy config follows the example in the master cluster tutorial aside from using 12h instead of 1m for the timeout values as mentioned in this issue.

Do the minion logs give an indication of what the issue may be?

anthonyra commented 2 months ago

I think it was a combination of the master default config setting the user as salt while the minion was setting the user as root and the bootstrap script auto starting the daemons (so when I changed the config I didn't restart the service just started it). I was able to get it working thank you for the help/feedback!