Slow response from Windows minions on 2018.3

damntoken commented 6 years ago

Description of Issue/Question

The issue is described and supposedly fixed, but I seem to be having the same problem for a long time now. https://github.com/saltstack/salt/issues/27866

8 sec might seem nothing but try to debugrun some simple command 60 times in a minute with +8 sec delay added. This is a major pain compared to the older puppet system I managed.

Setup

Salt 2018.3.2 as master and 2018.3.1 as client on Win2016 Py 2.7 both.

Steps to Reproduce Issue

time salt "client" test.ping client: True real 0m8.364s user 0m0.952s sys 0m0.112s

After changing client config "multiprocessing: False" But I have read this is not recommended and even config has warnings. So What am I supposed to do, besides looking for alternatives to salt itself? client: True real 0m1.160s user 0m0.808s sys 0m0.136s

Master "client" Ubuntu 16.04 has been always fast. No need to change anything. master: True real 0m1.291s user 0m0.812s sys 0m0.136s

Versions Report

Salt Version: Salt: 2018.3.2

Dependency Versions: dateutil: 2.4.2 Jinja2: 2.8 msgpack-python: 0.4.6 pycrypto: 2.6.1 Python: 2.7.12 (default, Dec 4 2017, 14:50:18) python-gnupg: 0.3.8 PyYAML: 3.11 PyZMQ: 15.2.0 Tornado: 4.2.1 ZMQ: 4.1.4

System Versions: version: Ubuntu 16.04 xenial

Ch3LL commented 6 years ago

ping @saltstack/team-windows any ideas here?

morganwillcock commented 6 years ago

Some ideas:

Windows Defender will probably be scanning everything unless it is turned off, so you could try disabling the real scanning for the Salt installation folder
Scheduled maintenance tasks tend to grind the disk to a halt, so it would be good to check that your 8 second delay wasn't from when these were running
All the Server 2016 installations I've seen tend to end up with a broken component database and have the 'Windows Modules Installer Worker' stuck at 100% CPU usage, so best check that isn't happening too

twangboy commented 6 years ago

This is an issue with the way Python forks processes on Windows. @dwoz This is the issue I was talking to you about a while back.

damon-atkins commented 6 years ago

My understanding is that there is a time out. And that the minion is meant to response with "I am working on it please wait" I suspect this is not happening on windows? As we should not need to use the -t option. See also https://github.com/saltstack/salt/issues/48470. A better solution maybe not to fork at all but to call something like "salt-call" which reads grains etc from a file create when the minion started, if this is quicker and pass the results back.

kedare commented 5 years ago

We are having the same issues on Windows. We had some critical issues as the Windows minions were taking more than 10s to answer the find_job request from the master and the salt CLI ended up returning blank job result (Like for an empty state.apply) without waiting for it to end like this :

Executing run on [u'bcn6-web-staging-1']

bcn6-web-staging-1:
    ----------

<Nothing else>

We had to increase gather_job_timeout to 30s to fix the issue in our case

For 2 VM on the same host to answer a test.ping, there is a huge difference: Windows: 16s Linux: 4s

One difference in my case is that the minions are running Python 3 on Windows

nicholasaleks commented 5 years ago

Been dealing with this behavior on Windows for a while now... I have yet to see anything to improve response times

jtauke commented 5 years ago

I've been testing salt for primarily windows environment for a few days. x64 Py3 windows minion 2018.3.2

Excluding C;\salt doesn't seem to help. Commands still timeout.
Turning off everything related to windows defender DOES, seem to get the response time down to within the normal timeout period, but that seems like a poor solution in any case.

morganwillcock commented 5 years ago

It sounds a lot like adding path exceptions to Windows Defender might stop on-access scanning within those paths, but isn't excluding processes launched from other types of scanning. You might get similar symptoms from poor disk throughput though. Are you using shared storage for the system drive?

jtauke commented 5 years ago

Current testing is limited to a pair of VMs running on virtualbox (master and a server 2016 minion).

Those are just being backed by a consumer grade ssd.

morganwillcock commented 5 years ago

How is your CPU allocated to the VMs? I imagine that if you are using a type 2 hypervisor you could end up with some weird latency / timeout issues if CPU cores are over committed (i.e. you have 4 CPU cores available, and both of these VMs are assigned 4 CPU cores).

jtauke commented 5 years ago

4 Cores available - 1 assigned to salt master, 2 assigned to windows server VM

damianw2 commented 5 years ago

Very slow response on Win2016 from Azure. Must increase timeout to 120s, otherwise Minion did not return. [No response] error appears.

damon-atkins commented 5 years ago

I have an idea on how to fix it. i.e. stop forking for find_job. I looked into this on Linux, to see if I could reduce the forking.

In salt/minion.py Try 1 was:

        if isinstance(data['fun'], six.string_types):
            if data['fun'] == 'sys.reload_modules':
                self.functions, self.returners, self.function_errors, self.executors = self._load_modules()
                self.schedule.functions = self.functions
                self.schedule.returners = self.returners
            elif data['fun'] == 'saltutil.find_job':   # do not muck around just find and return the data
                # this is what find_job does,  salt.utils.minion is already imported
                data['ret']={}
                for running_data in salt.utils.minion.running(self.opts):
                    if running_data['jid'] == data['jid']:
                        data['ret']=running_data['jid']
                        break
                self._target(self, self.opts, data, self.connected)  # should be calling something else

Could not get the above to work i.e. needs someone who knows this code better.

Try 2 was in _handle_decoded_payload

        instance = self
        multiprocessing_enabled = self.opts.get('multiprocessing', True)
        # For often called fast returning values save forking esp. on Windows.
        if isinstance(data['fun'], six.string_types) and data['fun'] in ['saltutil.find_job']:
            multiprocessing_enabled = False

The above resulting in single fork instead of two on linux.

Looking at the code Windows in general appears to have catches to try and stop it forking as much as Linux does which is a lot faster at forking.

Option 1 would be best for Windows.

Can someone who knows minion.py better, have a look at this. Maybe yourself @Ch3LL

damianw2 commented 5 years ago

Any update? Do you plan to fix it? Working with more than 2 minions on Win2016 is impossible because of timeouts. @Ch3LL

damon-atkins commented 5 years ago

salt-call is faster on windows than going via the salt-master.

It should be the number one priority for Windows. Unfortunately I do not known the code will enough in salt/minion.py to fix it. I believe the fix is to pull the find_job code into minion.py so not fork or extra thread is not required for a response. Maybe @cachedout can have a go. I could not get https://github.com/saltstack/salt/issues/48882#issuecomment-438294219 to not fork or create a thread. I could not work out how to do a direct return of data.

kedare commented 5 years ago

Why not use multi threading ? That should be much faster as most of the code would already be loaded

damon-atkins commented 5 years ago

Threading seems to be expense as well.

time salt -t 120 ServerName saltutil.find_job 20160503150049487736 # Job made up
ServerName:
    ----------
real    0m18.273s
user    0m5.131s
sys     0m0.542s

The three important lines, note the 10 second gap.

2018-11-22 23:23:18,929 12704    2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\providers.conf
2018-11-22 23:23:27,871 12704    1114  [salt.minion      .__init__][INFO    ] Creating minion process manager
2018-11-22 23:23:28,031 12704    1563  [salt.minion      ._thread_return][INFO    ] Starting a new job with PID 1270

First number is the PID, the next number is line-number. Full log of the request

2018-11-22 23:23:16,053 2072     1428  [salt.minion      ._handle_decoded_payload][INFO    ] User root Executing command saltutil.find_job with jid 20181122232317340079
2018-11-22 23:23:16,053 2072     1435  [salt.minion      ._handle_decoded_payload][DEBUG   ] Command details {'fun': 'saltutil.find_job', 'tgt_type': 'glob', 'tgt': 'ServerNameHere', 'arg': ['20160503150049487736'], 'user': 'root', 'jid': '20181122232317340079', 'ret': ''}
2018-11-22 23:23:18,791 12704    890   [salt.log.setup   .setup_multiprocessing_logging][DEBUG   ] Multiprocessing queue logging configured for the process running under PID: 12704
2018-11-22 23:23:18,795 12704    697   [salt.loader      .grains][DEBUG   ] Grains refresh requested. Refreshing grains.
2018-11-22 23:23:18,796 12704    2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion
2018-11-22 23:23:18,921 12704    2260  [salt.config      .include_config][DEBUG   ] Including configuration from 'c:\salt\conf\minion.d\_schedule.conf'
2018-11-22 23:23:18,921 12704    2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\_schedule.conf
2018-11-22 23:23:18,925 12704    2260  [salt.config      .include_config][DEBUG   ] Including configuration from 'c:\salt\conf\minion.d\minion_push.conf'
2018-11-22 23:23:18,926 12704    2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\minion_push.conf
2018-11-22 23:23:18,929 12704    2260  [salt.config      .include_config][DEBUG   ] Including configuration from 'c:\salt\conf\minion.d\providers.conf'
2018-11-22 23:23:18,929 12704    2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\providers.conf
2018-11-22 23:23:27,871 12704    1114  [salt.minion      .__init__][INFO    ] Creating minion process manager
2018-11-22 23:23:28,031 12704    1563  [salt.minion      ._thread_return][INFO    ] Starting a new job with PID 12704
2018-11-22 23:23:28,100 12704    100   [salt.utils.lazy  .__getitem__][DEBUG   ] LazyLoaded saltutil.find_job
2018-11-22 23:23:28,101 12704    100   [salt.utils.lazy  .__getitem__][DEBUG   ] LazyLoaded direct_call.execute
2018-11-22 23:23:28,102 12704    755   [salt.minion      ._return_retry_timer][DEBUG   ] Minion return retry timer set to 6 seconds (randomized)
2018-11-22 23:23:28,103 12704    1860  [salt.minion      ._return_pub][INFO    ] Returning information for job: 20181122232317340079
2018-11-22 23:23:28,281 12704    128   [salt.transport.zeromq.__new__][DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'ServerNameHere', 'tcp://ServerNameHere:4506', 'aes')
2018-11-22 23:23:28,282 12704    463   [salt.crypt       .__new__][DEBUG   ] Initializing new AsyncAuth for ('c:\\salt\\conf\\pki\\minion', 'digit.lakeside.lan', 'tcp://ServerNameHere:4506')
2018-11-22 23:23:28,306 12704    194   [salt.transport.zeromq.__singleton_init__][DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://ServerNameHere:4506
2018-11-22 23:23:28,343 12704    1029  [salt.transport.zeromq._init_socket][DEBUG   ] Trying to connect to: tcp://192.168.53.3:4506
2018-11-22 23:23:28,345 12704    128   [salt.transport.zeromq.__new__][DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'ServerNameHere', 'tcp://ServerNameHere:4506', 'clear')
2018-11-22 23:23:28,345 12704    194   [salt.transport.zeromq.__singleton_init__][DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://ServerNameHere:4506
2018-11-22 23:23:28,381 12704    1029  [salt.transport.zeromq._init_socket][DEBUG   ] Trying to connect to: tcp://192.168.53.3:4506
2018-11-22 23:23:28,381 12704    207   [salt.crypt       .get_rsa_pub_key][DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
2018-11-22 23:23:28,494 12704    847   [salt.crypt       .decrypt_aes][DEBUG   ] Decrypting the current master AES key
2018-11-22 23:23:28,495 12704    199   [salt.crypt       .get_rsa_key][DEBUG   ] salt.crypt.get_rsa_key: Loading private key
2018-11-22 23:23:28,495 12704    179   [salt.crypt       ._get_key_with_evict][DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
2018-11-22 23:23:28,497 12704    776   [salt.crypt       .get_keys][DEBUG   ] Loaded minion key: c:\salt\conf\pki\minion\minion.pem
2018-11-22 23:23:28,576 12704    207   [salt.crypt       .get_rsa_pub_key][DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key

damon-atkins commented 5 years ago

Also keep in mine that find_job is called about every 5 to 15 seconds while a request is running and it takes about 10 seconds for it to be processed.

damon-atkins commented 5 years ago

More digging on windows. Seems salt.loader.grains(opts) is called on every request. I suspect the more grains the slower this will be. On Linux this only runs once at startup.

I suspect minions need to save "grains" collected to a file using say msgpack and reload it. And may be refresh the grains once an hour ????? As more grains are added to windows will get slower and slower.

2018-11-23 00:03:51,386 9056     1102  [salt.minion      .__init__][DEBUG   ] Minion(MinionBase) call salt.loader.grains(opts)
2018-11-23 00:03:51,386 9056     697   [salt.loader      .grains][DEBUG   ] Grains refresh requested. Refreshing grains.
2018-11-23 00:03:51,387 9056     2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion
2018-11-23 00:03:51,518 9056     2260  [salt.config      .include_config][DEBUG   ] Including configuration from 'c:\salt\conf\minion.d\_schedule.conf'
2018-11-23 00:03:51,519 9056     2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\_schedule.conf
2018-11-23 00:03:51,523 9056     2260  [salt.config      .include_config][DEBUG   ] Including configuration from 'c:\salt\conf\minion.d\minion_push.conf'
2018-11-23 00:03:51,523 9056     2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\minion_push.conf
2018-11-23 00:03:51,529 9056     2260  [salt.config      .include_config][DEBUG   ] Including configuration from 'c:\salt\conf\minion.d\providers.conf'
2018-11-23 00:03:51,530 9056     2117  [salt.config      ._read_conf_file][DEBUG   ] Reading configuration from c:\salt\conf\minion.d\providers.conf
2018-11-23 00:04:00,791 9056     1104  [salt.minion      .__init__][DEBUG   ] Minion(MinionBase) call salt.loader.grains(opts) completed

kedare commented 5 years ago

But salt-minion is not multi-threaded afaik, any command to run requires a new process to be created (from scratch as fork() is not supported)

damon-atkins commented 5 years ago

2018-11-23 00:37:45,672 21392    727   [salt.loader      .grains][ERROR   ] grains conf_file in opts end
2018-11-23 00:37:45,698 21392    735   [salt.loader      .grains][ERROR   ] grains Run core grains START
2018-11-23 00:37:54,385 21392    748   [salt.loader      .grains][ERROR   ] grains Run core grains END
2018-11-23 00:37:54,385 21392    751   [salt.loader      .grains][ERROR   ] grains Run the rest of the grains
2018-11-23 00:37:54,687 21392    782   [salt.loader      .grains][ERROR   ] grains Run the rest of the grains END
2018-11-23 00:37:54,687 21392    839   [salt.loader      .grains][ERROR   ] grains function ends

damon-atkins commented 5 years ago

The work around. I assume if used you need to tell the minion to refresh the grains

GRAINS_CACHE
Default: False

The minion can locally cache grain data instead of refreshing the data each time the grain is referenced. By default this feature is disabled, to enable set grains_cache to True.

grains_cache: False

damon-atkins commented 5 years ago

fork() in python works. Its just very slow. Fork is the default on Unix/Linux. It looks like it uses threads as the default for Windows.

However the issues is the grains are being refresh every time on Windows vs at startup on Linux/Unix. More Windows grains the slower it will get. (esp. if they user powershell which is slow to start. I developed cmd.shell_info to reduce the number of times powershell was called just to get its version info)

damon-atkins commented 5 years ago

On a side note it would be still be better if find_job was inbuilt into minion.py.... with no forking or thread creation.

damianw2 commented 5 years ago

fork() in python works. Its just very slow. Fork is the default on Unix/Linux. It looks like to use threads is the default for Windows.

However the issues is the grains are being refresh every time on Windows vs at startup on Linux/Unix. More Windows grains the slower it will get. (esp. if they user powershell which is slow to start. I developed cmd.shell_info to reduce the number of times powershell was called just to get its version info)

I have minion with no grains and responses are very slowly so it's not a grain issue. I trigger state files from salt master.

damon-atkins commented 5 years ago

All minions have grains. i.e. the Core Grains like IP, OS Version etc. in the log above they are taking just under 10 seconds. i.e. salt -G os:Windows grains.items

damianw2 commented 5 years ago

All minions have grains. i.e. the Core Grains like IP, OS Version etc. i.e. salt -G os:Windows grains.items

Ok, sorry, no custom grains ;)

damon-atkins commented 5 years ago

Basically what is happening. You run a long running task or highstate on windows this kicks off collecting the grains again (which should only happen at start up but on windows its every time). Because its a long running task find_job is called every 5 to 15 seconds, to check how the long running task is going. However find_job also cause the grains to be processed as well (10sec). So the general load on the windows server goes up as well because collecting all the grains all the time every about 10 seconds.

Hence my 2c also says find_job needs to be inbuilt into minion.py to reduce its overhead on all platforms. Which is 2nd issues. The first issue is grains are collected every time a request of anything is made.

damon-atkins commented 5 years ago

@damianw2 @kedare Try grains_cache: True in the minion config of a non-production server. Then use salt -G os:Windows saltutil.refresh_grains to refresh the grains now and then.

See if that fixes the slowness. Its a big work around.

damon-atkins commented 5 years ago

@saltstack/team-core Long term I would like to see grains data and config data cached in msgpack (or plain json text file).

damianw2 commented 5 years ago

no help. Minion restarted after config changes. //------------- Total states run: 10 Total run time: 76.674 s root@SaltMaster:~# salt 'kpzk-UKweb' saltutil.refresh_grains --timeout=180 kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net: True root@SaltMaster:~# salt 'kpzk-UKweb' state.apply appserver --timeout=180 //------------- Succeeded: 10 Failed: 0 //------------- Total states run: 10 Total run time: 139.829 s

damon-atkins commented 5 years ago

Try things like time salt -G os:Windows disk.usage time salt -G os:Windows cmd.run "dir \Windows /s" It should not of made it slower. Try it without the timeout option Might have a bit more of look at it tomorrow.

PS use highlight the text and click <> around output

damianw2 commented 5 years ago

time salt -G os:Windows disk.usage

real    0m16.118s
user    0m0.541s
sys     0m0.092s

root@SaltMaster:~# time salt -G os:Windows cmd.run "dir \Windows /s"


kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net:
    VALUE_TRIMMED
kpzk-USweb0.umdkseo133gudkwz3zyevj14yc.xx.internal.cloudapp.net:
    Minion did not return. [No response]

real    1m16.355s
user    0m1.207
```s
sys     0m0.148s

damon-atkins commented 5 years ago

That's good news assuming UK box has the grains_cache setting and the other does not.

damianw2 commented 5 years ago

No, it's not. State files still respond very slowly, even if all the tasks are done before and state file should only check current status and do nothing. Example:

root@SaltMaster:~# time salt 'kpzk-UKweb*' state.apply appserver
Summary for kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net
-------------
Succeeded: 10
Failed:     0
-------------
Total states run:     10
Total run time:   88.623 s

real    1m54.140s
user    0m1.630s
sys     0m0.147s

2nd time:

Summary for kpzk-UKweb0.y1tt4qlrvx3e1pzlry44o2vawa.cwx.internal.cloudapp.net
-------------
Succeeded: 10
Failed:     0
-------------
Total states run:     10
Total run time:   90.950 s

real    1m50.369s

State file:

[{% set pool_name = salt['cmd.run']('hostname') %}
---
install_prerequisites:
  pkg.installed:
    -
      pkgs:
        - npp
        - ms-vcpp-2012-redist_x64
        - dotnet-sdk

IIS-Web-Common-Http:
  win_servermanager.installed:
    - recurse: True
    - features:
      - Web-Common-Http
    - exclude:
      - Web-Http-Redirect
      - Web-DAV-Publishing

IIS-Web-Http-Logging:
 win_servermanager.installed:
    - recurse: false
    - features:
      - Web-Http-Logging

IIS-Web-Performance:
  win_servermanager.installed:
    - recurse: false
    - features:
      - Web-Stat-Compression

IIS-Web-Security:
  win_servermanager.installed:
    - recurse: false
    - features:
      - Web-Filtering

IIS-Web-Mgmt-Tools:
  win_servermanager.installed:
    - recurse: false
    - features:
      - Web-Mgmt-Console
      - Web-Mgmt-Service
NET-Framework:
  win_servermanager.installed:
    - recurse: false
    - features:
      - NET-Framework-45-ASPNET
      - NET-WCF-TCP-PortSharing45

site0-apppool:
    win_iis.create_apppool:
        - name: {{ pool_name }}

site0-site-setting:
    win_iis.deployed:
        - name: WebApp_{{ pool_name }}
        - sourcepath: c:\inetpub
        - apppool: {{ pool_name }}

site0-v1-app:
    win_iis.create_app:
        - name: WebApp_Phlexglobal
        - site: WebApp_{{ pool_name }}
        - sourcepath: C:\inetpub\temp
        - apppool: {{ pool_name }}

damon-atkins commented 5 years ago

win_iis & win_servermanager calls powershell, it will be slow. PowerShell takes 10 seconds to start. i.e. 10 seconds to find out the current state of the settings before deciding if action is required as it does not match the required state. So I am not surprised it took this time. 9 x 10sec = 90+seconds

grains_cache: True is a work around for Minion did not return. [No response] With this you should not need to set the timeout/-t option. It should now get results, even if the results take time to process.

There will be a small performance improvement, could not tell you if it will be noticeable or not.

damon-atkins commented 5 years ago

PS you can use a grain instead of {% set pool_name = salt['cmd.run']('hostname') %}

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

ITJamie commented 2 years ago

IMHO this was not resolved, windows responses are still much slower than linux boxes

damon-atkins commented 2 years ago

The issues was that grains were being run everytime, when grains should only be run at startup (or when you use salt-call). This has been resolve about year ago now. I suggest you upgrade.

saltstack / salt