Closed damntoken closed 4 years ago
ping @saltstack/team-windows any ideas here?
Some ideas:
This is an issue with the way Python forks processes on Windows. @dwoz This is the issue I was talking to you about a while back.
My understanding is that there is a time out. And that the minion is meant to response with "I am working on it please wait" I suspect this is not happening on windows? As we should not need to use the -t option. See also https://github.com/saltstack/salt/issues/48470. A better solution maybe not to fork at all but to call something like "salt-call" which reads grains etc from a file create when the minion started, if this is quicker and pass the results back.
We are having the same issues on Windows.
We had some critical issues as the Windows minions were taking more than 10s to answer the find_job
request from the master and the salt
CLI ended up returning blank job result (Like for an empty state.apply) without waiting for it to end like this :
Executing run on [u'bcn6-web-staging-1']
bcn6-web-staging-1:
----------
<Nothing else>
We had to increase gather_job_timeout
to 30s to fix the issue in our case
For 2 VM on the same host to answer a test.ping
, there is a huge difference:
Windows: 16s
Linux: 4s
One difference in my case is that the minions are running Python 3 on Windows
Been dealing with this behavior on Windows for a while now... I have yet to see anything to improve response times
I've been testing salt for primarily windows environment for a few days. x64 Py3 windows minion 2018.3.2
Excluding C;\salt doesn't seem to help. Commands still timeout.
Turning off everything related to windows defender DOES, seem to get the response time down to within the normal timeout period, but that seems like a poor solution in any case.
It sounds a lot like adding path exceptions to Windows Defender might stop on-access scanning within those paths, but isn't excluding processes launched from other types of scanning. You might get similar symptoms from poor disk throughput though. Are you using shared storage for the system drive?
Current testing is limited to a pair of VMs running on virtualbox (master and a server 2016 minion).
Those are just being backed by a consumer grade ssd.
How is your CPU allocated to the VMs? I imagine that if you are using a type 2 hypervisor you could end up with some weird latency / timeout issues if CPU cores are over committed (i.e. you have 4 CPU cores available, and both of these VMs are assigned 4 CPU cores).
4 Cores available - 1 assigned to salt master, 2 assigned to windows server VM
Very slow response on Win2016 from Azure. Must increase timeout to 120s, otherwise Minion did not return. [No response] error appears.
I have an idea on how to fix it. i.e. stop forking for find_job. I looked into this on Linux, to see if I could reduce the forking.
In salt/minion.py Try 1 was:
if isinstance(data['fun'], six.string_types):
if data['fun'] == 'sys.reload_modules':
self.functions, self.returners, self.function_errors, self.executors = self._load_modules()
self.schedule.functions = self.functions
self.schedule.returners = self.returners
elif data['fun'] == 'saltutil.find_job': # do not muck around just find and return the data
# this is what find_job does, salt.utils.minion is already imported
data['ret']={}
for running_data in salt.utils.minion.running(self.opts):
if running_data['jid'] == data['jid']:
data['ret']=running_data['jid']
break
self._target(self, self.opts, data, self.connected) # should be calling something else
Could not get the above to work i.e. needs someone who knows this code better.
Try 2 was in _handle_decoded_payload
instance = self
multiprocessing_enabled = self.opts.get('multiprocessing', True)
# For often called fast returning values save forking esp. on Windows.
if isinstance(data['fun'], six.string_types) and data['fun'] in ['saltutil.find_job']:
multiprocessing_enabled = False
The above resulting in single fork instead of two on linux.
Looking at the code Windows in general appears to have catches to try and stop it forking as much as Linux does which is a lot faster at forking.
Option 1 would be best for Windows.
Can someone who knows minion.py better, have a look at this. Maybe yourself @Ch3LL
Any update? Do you plan to fix it? Working with more than 2 minions on Win2016 is impossible because of timeouts. @Ch3LL
salt-call is faster on windows than going via the salt-master.
It should be the number one priority for Windows. Unfortunately I do not known the code will enough in salt/minion.py to fix it. I believe the fix is to pull the find_job code into minion.py so not fork or extra thread is not required for a response. Maybe @cachedout can have a go. I could not get https://github.com/saltstack/salt/issues/48882#issuecomment-438294219 to not fork or create a thread. I could not work out how to do a direct return of data.
Why not use multi threading ? That should be much faster as most of the code would already be loaded
Threading seems to be expense as well.
time salt -t 120 ServerName saltutil.find_job 20160503150049487736 # Job made up
ServerName:
----------
real 0m18.273s
user 0m5.131s
sys 0m0.542s
The three important lines, note the 10 second gap.
2018-11-22 23:23:18,929 12704 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\providers.conf
2018-11-22 23:23:27,871 12704 1114 [salt.minion .__init__][INFO ] Creating minion process manager
2018-11-22 23:23:28,031 12704 1563 [salt.minion ._thread_return][INFO ] Starting a new job with PID 1270
First number is the PID, the next number is line-number. Full log of the request
2018-11-22 23:23:16,053 2072 1428 [salt.minion ._handle_decoded_payload][INFO ] User root Executing command saltutil.find_job with jid 20181122232317340079
2018-11-22 23:23:16,053 2072 1435 [salt.minion ._handle_decoded_payload][DEBUG ] Command details {'fun': 'saltutil.find_job', 'tgt_type': 'glob', 'tgt': 'ServerNameHere', 'arg': ['20160503150049487736'], 'user': 'root', 'jid': '20181122232317340079', 'ret': ''}
2018-11-22 23:23:18,791 12704 890 [salt.log.setup .setup_multiprocessing_logging][DEBUG ] Multiprocessing queue logging configured for the process running under PID: 12704
2018-11-22 23:23:18,795 12704 697 [salt.loader .grains][DEBUG ] Grains refresh requested. Refreshing grains.
2018-11-22 23:23:18,796 12704 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion
2018-11-22 23:23:18,921 12704 2260 [salt.config .include_config][DEBUG ] Including configuration from 'c:\salt\conf\minion.d\_schedule.conf'
2018-11-22 23:23:18,921 12704 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\_schedule.conf
2018-11-22 23:23:18,925 12704 2260 [salt.config .include_config][DEBUG ] Including configuration from 'c:\salt\conf\minion.d\minion_push.conf'
2018-11-22 23:23:18,926 12704 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\minion_push.conf
2018-11-22 23:23:18,929 12704 2260 [salt.config .include_config][DEBUG ] Including configuration from 'c:\salt\conf\minion.d\providers.conf'
2018-11-22 23:23:18,929 12704 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\providers.conf
2018-11-22 23:23:27,871 12704 1114 [salt.minion .__init__][INFO ] Creating minion process manager
2018-11-22 23:23:28,031 12704 1563 [salt.minion ._thread_return][INFO ] Starting a new job with PID 12704
2018-11-22 23:23:28,100 12704 100 [salt.utils.lazy .__getitem__][DEBUG ] LazyLoaded saltutil.find_job
2018-11-22 23:23:28,101 12704 100 [salt.utils.lazy .__getitem__][DEBUG ] LazyLoaded direct_call.execute
2018-11-22 23:23:28,102 12704 755 [salt.minion ._return_retry_timer][DEBUG ] Minion return retry timer set to 6 seconds (randomized)
2018-11-22 23:23:28,103 12704 1860 [salt.minion ._return_pub][INFO ] Returning information for job: 20181122232317340079
2018-11-22 23:23:28,281 12704 128 [salt.transport.zeromq.__new__][DEBUG ] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'ServerNameHere', 'tcp://ServerNameHere:4506', 'aes')
2018-11-22 23:23:28,282 12704 463 [salt.crypt .__new__][DEBUG ] Initializing new AsyncAuth for ('c:\\salt\\conf\\pki\\minion', 'digit.lakeside.lan', 'tcp://ServerNameHere:4506')
2018-11-22 23:23:28,306 12704 194 [salt.transport.zeromq.__singleton_init__][DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://ServerNameHere:4506
2018-11-22 23:23:28,343 12704 1029 [salt.transport.zeromq._init_socket][DEBUG ] Trying to connect to: tcp://192.168.53.3:4506
2018-11-22 23:23:28,345 12704 128 [salt.transport.zeromq.__new__][DEBUG ] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'ServerNameHere', 'tcp://ServerNameHere:4506', 'clear')
2018-11-22 23:23:28,345 12704 194 [salt.transport.zeromq.__singleton_init__][DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://ServerNameHere:4506
2018-11-22 23:23:28,381 12704 1029 [salt.transport.zeromq._init_socket][DEBUG ] Trying to connect to: tcp://192.168.53.3:4506
2018-11-22 23:23:28,381 12704 207 [salt.crypt .get_rsa_pub_key][DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
2018-11-22 23:23:28,494 12704 847 [salt.crypt .decrypt_aes][DEBUG ] Decrypting the current master AES key
2018-11-22 23:23:28,495 12704 199 [salt.crypt .get_rsa_key][DEBUG ] salt.crypt.get_rsa_key: Loading private key
2018-11-22 23:23:28,495 12704 179 [salt.crypt ._get_key_with_evict][DEBUG ] salt.crypt._get_key_with_evict: Loading private key
2018-11-22 23:23:28,497 12704 776 [salt.crypt .get_keys][DEBUG ] Loaded minion key: c:\salt\conf\pki\minion\minion.pem
2018-11-22 23:23:28,576 12704 207 [salt.crypt .get_rsa_pub_key][DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
Also keep in mine that find_job is called about every 5 to 15 seconds while a request is running and it takes about 10 seconds for it to be processed.
More digging on windows. Seems salt.loader.grains(opts) is called on every request. I suspect the more grains the slower this will be. On Linux this only runs once at startup.
I suspect minions need to save "grains" collected to a file using say msgpack and reload it. And may be refresh the grains once an hour ????? As more grains are added to windows will get slower and slower.
2018-11-23 00:03:51,386 9056 1102 [salt.minion .__init__][DEBUG ] Minion(MinionBase) call salt.loader.grains(opts)
2018-11-23 00:03:51,386 9056 697 [salt.loader .grains][DEBUG ] Grains refresh requested. Refreshing grains.
2018-11-23 00:03:51,387 9056 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion
2018-11-23 00:03:51,518 9056 2260 [salt.config .include_config][DEBUG ] Including configuration from 'c:\salt\conf\minion.d\_schedule.conf'
2018-11-23 00:03:51,519 9056 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\_schedule.conf
2018-11-23 00:03:51,523 9056 2260 [salt.config .include_config][DEBUG ] Including configuration from 'c:\salt\conf\minion.d\minion_push.conf'
2018-11-23 00:03:51,523 9056 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\minion_push.conf
2018-11-23 00:03:51,529 9056 2260 [salt.config .include_config][DEBUG ] Including configuration from 'c:\salt\conf\minion.d\providers.conf'
2018-11-23 00:03:51,530 9056 2117 [salt.config ._read_conf_file][DEBUG ] Reading configuration from c:\salt\conf\minion.d\providers.conf
2018-11-23 00:04:00,791 9056 1104 [salt.minion .__init__][DEBUG ] Minion(MinionBase) call salt.loader.grains(opts) completed
But salt-minion is not multi-threaded afaik, any command to run requires a new process to be created (from scratch as fork() is not supported)
2018-11-23 00:37:45,672 21392 727 [salt.loader .grains][ERROR ] grains conf_file in opts end
2018-11-23 00:37:45,698 21392 735 [salt.loader .grains][ERROR ] grains Run core grains START
2018-11-23 00:37:54,385 21392 748 [salt.loader .grains][ERROR ] grains Run core grains END
2018-11-23 00:37:54,385 21392 751 [salt.loader .grains][ERROR ] grains Run the rest of the grains
2018-11-23 00:37:54,687 21392 782 [salt.loader .grains][ERROR ] grains Run the rest of the grains END
2018-11-23 00:37:54,687 21392 839 [salt.loader .grains][ERROR ] grains function ends
The work around. I assume if used you need to tell the minion to refresh the grains
GRAINS_CACHE
Default: False
The minion can locally cache grain data instead of refreshing the data each time the grain is referenced. By default this feature is disabled, to enable set grains_cache to True.
grains_cache: False
fork() in python works. Its just very slow. Fork is the default on Unix/Linux. It looks like it uses threads as the default for Windows.
However the issues is the grains are being refresh every time on Windows vs at startup on Linux/Unix. More Windows grains the slower it will get. (esp. if they user powershell which is slow to start. I developed cmd.shell_info to reduce the number of times powershell was called just to get its version info)
On a side note it would be still be better if find_job
was inbuilt into minion.py.... with no forking or thread creation.
fork() in python works. Its just very slow. Fork is the default on Unix/Linux. It looks like to use threads is the default for Windows.
However the issues is the grains are being refresh every time on Windows vs at startup on Linux/Unix. More Windows grains the slower it will get. (esp. if they user powershell which is slow to start. I developed cmd.shell_info to reduce the number of times powershell was called just to get its version info)
I have minion with no grains and responses are very slowly so it's not a grain issue. I trigger state files from salt master.
All minions have grains. i.e. the Core Grains like IP, OS Version etc. in the log above they are taking just under 10 seconds.
i.e. salt -G os:Windows grains.items
All minions have grains. i.e. the Core Grains like IP, OS Version etc. i.e.
salt -G os:Windows grains.items
Ok, sorry, no custom grains ;)
Basically what is happening. You run a long running task or highstate
on windows this kicks off collecting the grains again (which should only happen at start up but on windows its every time). Because its a long running task find_job
is called every 5 to 15 seconds, to check how the long running task is going. However find_job
also cause the grains to be processed as well (10sec). So the general load on the windows server goes up as well because collecting all the grains all the time every about 10 seconds.
Hence my 2c also says find_job needs to be inbuilt into minion.py to reduce its overhead on all platforms. Which is 2nd issues. The first issue is grains are collected every time a request of anything is made.
@damianw2 @kedare
Try grains_cache: True
in the minion config of a non-production server.
Then use salt -G os:Windows saltutil.refresh_grains
to refresh the grains now and then.
See if that fixes the slowness. Its a big work around.
@saltstack/team-core Long term I would like to see grains data and config data cached in msgpack (or plain json text file).
no help. Minion restarted after config changes. //------------- Total states run: 10 Total run time: 76.674 s root@SaltMaster:~# salt 'kpzk-UKweb' saltutil.refresh_grains --timeout=180 kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net: True root@SaltMaster:~# salt 'kpzk-UKweb' state.apply appserver --timeout=180 //------------- Succeeded: 10 Failed: 0 //------------- Total states run: 10 Total run time: 139.829 s
Try things like
time salt -G os:Windows disk.usage
time salt -G os:Windows cmd.run "dir \Windows /s"
It should not of made it slower. Try it without the timeout option
Might have a bit more of look at it tomorrow.
PS use highlight the text and click <>
around output
time salt -G os:Windows disk.usage
real 0m16.118s
user 0m0.541s
sys 0m0.092s
root@SaltMaster:~# time salt -G os:Windows cmd.run "dir \Windows /s"
kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net:
VALUE_TRIMMED
kpzk-USweb0.umdkseo133gudkwz3zyevj14yc.xx.internal.cloudapp.net:
Minion did not return. [No response]
real 1m16.355s
user 0m1.207
```s
sys 0m0.148s
That's good news assuming UK box has the grains_cache setting and the other does not.
No, it's not. State files still respond very slowly, even if all the tasks are done before and state file should only check current status and do nothing. Example:
root@SaltMaster:~# time salt 'kpzk-UKweb*' state.apply appserver
Summary for kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net
-------------
Succeeded: 10
Failed: 0
-------------
Total states run: 10
Total run time: 88.623 s
real 1m54.140s
user 0m1.630s
sys 0m0.147s
2nd time:
Summary for kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net
-------------
Succeeded: 10
Failed: 0
-------------
Total states run: 10
Total run time: 90.950 s
real 1m50.369s
State file:
[{% set pool_name = salt['cmd.run']('hostname') %}
---
install_prerequisites:
pkg.installed:
-
pkgs:
- npp
- ms-vcpp-2012-redist_x64
- dotnet-sdk
IIS-Web-Common-Http:
win_servermanager.installed:
- recurse: True
- features:
- Web-Common-Http
- exclude:
- Web-Http-Redirect
- Web-DAV-Publishing
IIS-Web-Http-Logging:
win_servermanager.installed:
- recurse: false
- features:
- Web-Http-Logging
IIS-Web-Performance:
win_servermanager.installed:
- recurse: false
- features:
- Web-Stat-Compression
IIS-Web-Security:
win_servermanager.installed:
- recurse: false
- features:
- Web-Filtering
IIS-Web-Mgmt-Tools:
win_servermanager.installed:
- recurse: false
- features:
- Web-Mgmt-Console
- Web-Mgmt-Service
NET-Framework:
win_servermanager.installed:
- recurse: false
- features:
- NET-Framework-45-ASPNET
- NET-WCF-TCP-PortSharing45
site0-apppool:
win_iis.create_apppool:
- name: {{ pool_name }}
site0-site-setting:
win_iis.deployed:
- name: WebApp_{{ pool_name }}
- sourcepath: c:\inetpub
- apppool: {{ pool_name }}
site0-v1-app:
win_iis.create_app:
- name: WebApp_Phlexglobal
- site: WebApp_{{ pool_name }}
- sourcepath: C:\inetpub\temp
- apppool: {{ pool_name }}
win_iis & win_servermanager calls powershell, it will be slow. PowerShell takes 10 seconds to start. i.e. 10 seconds to find out the current state of the settings before deciding if action is required as it does not match the required state. So I am not surprised it took this time. 9 x 10sec = 90+seconds
grains_cache: True
is a work around for Minion did not return. [No response]
With this you should not need to set the timeout/-t option. It should now get results, even if the results take time to process.
There will be a small performance improvement, could not tell you if it will be noticeable or not.
PS you can use a grain instead of {% set pool_name = salt['cmd.run']('hostname') %}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
IMHO this was not resolved, windows responses are still much slower than linux boxes
The issues was that grains were being run everytime, when grains should only be run at startup (or when you use salt-call). This has been resolve about year ago now. I suggest you upgrade.
Description of Issue/Question
The issue is described and supposedly fixed, but I seem to be having the same problem for a long time now. https://github.com/saltstack/salt/issues/27866
8 sec might seem nothing but try to debugrun some simple command 60 times in a minute with +8 sec delay added. This is a major pain compared to the older puppet system I managed.
Setup
Salt 2018.3.2 as master and 2018.3.1 as client on Win2016 Py 2.7 both.
Steps to Reproduce Issue
time salt "client" test.ping client: True real 0m8.364s user 0m0.952s sys 0m0.112s
After changing client config "multiprocessing: False" But I have read this is not recommended and even config has warnings. So What am I supposed to do, besides looking for alternatives to salt itself? client: True real 0m1.160s user 0m0.808s sys 0m0.136s
Master "client" Ubuntu 16.04 has been always fast. No need to change anything. master: True real 0m1.291s user 0m0.812s sys 0m0.136s
Versions Report
Salt Version: Salt: 2018.3.2
Dependency Versions: dateutil: 2.4.2 Jinja2: 2.8 msgpack-python: 0.4.6 pycrypto: 2.6.1 Python: 2.7.12 (default, Dec 4 2017, 14:50:18) python-gnupg: 0.3.8 PyYAML: 3.11 PyZMQ: 15.2.0 Tornado: 4.2.1 ZMQ: 4.1.4
System Versions: version: Ubuntu 16.04 xenial