saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

[3006.x][BUG] Salt-API unstable after upgrade from 3004.2 to 3006.1 #64550

Closed dmurphy18 closed 1 year ago

dmurphy18 commented 1 year ago

Description Salt-API unstable after upgrade from 3004.2 to 3006.1

Setup Customer is not using Aria Config components, only Salt open

Enabled eauth="auto" on one of our salt-master's in our failing -test environment.

external_auth: auto: saltuser: *admin_acls

then attempted to curl the salt-api. note the time it took to complete: 0m0.293s time curl -v -sSk https://s12345.svr.us.dog.net:8000/login -H 'Accept: application/json' -m 45 -d eauth=auto -d username=saltuser -d password=saltuser

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior enabled eauth="auto" on one of our salt-master's in our failing -test environment.

external_auth: auto: saltuser: *admin_acls

then attempted to curl the salt-api. note the time it took to complete: 0m0.293s

time curl -v -sSk https://s12345.svr.us.dog.net:8000/login -H 'Accept: application/json' -m 45 -d eauth=auto -d username=saltuser -d password=saltuser

About to connect() to s12345.svr.us.dog.net port 8000 (#0) Trying 192.168.173.29... Connected to s12345.svr.us.dog.net (192.168.173.29) port 8000 (#0) Initializing NSS with certpath: sql:/etc/pki/nssdb skipping SSL peer certificate verification SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 Server certificate: subject: CN=cassandra-test.dpg.net,OU=gti,O=Dog,L=Kat,ST=Ohio,C=US start date: Jul 20 10:16:27 2022 GMT expire date: Jul 20 10:16:27 2023 GMT common name: cassandra-test.dog.net issuer: CN=PSIN0P551,DC=exchad,DC=dog,DC=net

POST /login HTTP/1.1 User-Agent: curl/7.29.0 Host: 121345.svr.us.dog.net:8000 Accept: application/json Content-Length: 46 Content-Type: application/x-www-form-urlencoded

upload completely sent off: 46 out of 46 bytes < HTTP/1.1 200 OK < Content-Type: application/json < Server: CherryPy/18.6.1 < Date: Thu, 22 Jun 2023 18:36:43 GMT < Allow: GET, HEAD, POST < Access-Control-Allow-Origin: * < Access-Control-Expose-Headers: GET, POST < Access-Control-Allow-Credentials: true < X-Auth-Token: 441098ff2f9d91ec9c47bfa18c99f34e61a6eefb < Vary: Accept-Encoding < Content-Length: 210 < Set-Cookie: session_id=441098ff2f9d91ec9c47bfa18c99f34e61a6eefb; expires=Fri, 23 Jun 2023 04:36:43 GMT; Max-Age=36000; Path=/ < Connection #0 to host s12345.svr.us.dog.net left intact Unknown macro: {"return"} real 0m0.293s user 0m0.083s sys 0m0.074s

I then proceeded to run that curl command a few more times with similar results. However, after a minute or two, the request was unable to complete in under 45 seconds:

time curl -v -sSk https://mouse.svr.us.dog.net:8000/login -H 'Accept: application/json' -m 45 -d eauth=auto -d username=saltuser -d password=saltuser

About to connect() to mouse.svr.us.dog.net port 8000 (#0) Trying 192.169.200.136... Connected to mouse.svr.us.dog.net (192.169.200.136) port 8000 (#0) Initializing NSS with certpath: sql:/etc/pki/nssdb skipping SSL peer certificate verification SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 Server certificate: subject: CN=cassandra-test.dog.net,OU=gti,O=Dog,L=Kat,ST=Ohio,C=US start date: Jul 20 10:16:27 2022 GMT expire date: Jul 20 10:16:27 2023 GMT common name: cassandra-test.dpg.net issuer: CN=PSIN0P551,DC=exchad,DC=dog,DC=net

POST /login HTTP/1.1 User-Agent: curl/7.29.0 Host: mouse.svr.us.dog.net:8000 Accept: application/json Content-Length: 46 Content-Type: application/x-www-form-urlencoded

upload completely sent off: 46 out of 46 bytes Operation timed out after 45001 milliseconds with 0 out of -1 bytes received Closing connection 0 curl: (28) Operation timed out after 45001 milliseconds with 0 out of -1 bytes received real 0m45.014s user 0m0.088s sys 0m0.076s

  1. Expected behavior:

    • Salt-API should work as expected
  2. Versions Report:

    • mouse: Salt Version: Salt: 3006.1

Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo

s12345: Salt Version: Salt: 3006.1

Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo

Milkcart: Salt Version: Salt: 3006.1

Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo

  1. Additional context:
    • noticed number of exceptions in the event bus for minion data refresh events. I am not sure if this relevant but just mentioning it in the hope that it could be of some use.

minion/refresh/horse

{"Minion data cache refresh": "horse", "_stamp": "2023-06-22T14:05:47.065752"} _salt_error {"pretag": null, "cmd": "_minion_event", "tag": "_salt_error", "data":

{"message": "The minion function caused an exception", "args": ["The minion function caused an exception"], "_stamp": "2023-06-22T14:05:47.391652"} , "id": "carthorse", "_stamp": "2023-06-22T14:05:47.426772"}

--THROUGH TEMPLATE Salt Stack Default, PR is created by sfry, reviewed by jonessean,lskrzypczak,christopherw and submitted by jonessean--

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml PASTE HERE ```

Additional context Add any other context about the problem here.

dmurphy18 commented 1 year ago

Concerned that the versions reported for cherrypy are unknown cherrypy: unknown when they should be reporting cherrypy: 18.6.1, which makes me wonder how the upgrade from 3004.2 to 2006.1 was performed.

dmurphy18 commented 1 year ago

@Phillip-A-Fry test ping

Philip-A-Fry commented 1 year ago

Adding output from salt-run --versions-report (to include CherryPy version):

Salt Version: Salt: 3006.1

Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo

dmurphy18 commented 1 year ago

So issue with how they upgraded from 3004.1 to 3006.1. The log file minion-20230621 has Python 3.6 for example, 3006.1 would be Python 3.10, and not the default Python 3.6 for RHEL 7

 217   File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 624, in send
 218     recv = yield future 
 219   File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
 220     value = future.result()
 221   File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
 222     raise_exc_info(self._exc_info)
 223   File "<string>", line 4, in raise_exc_info
 224 salt.exceptions.SaltReqTimeoutError: Message timed out

Lots of exception in tornado too, and ERRORs

 362 2023-06-20 14:39:24,626 [salt.utils.event                                       :375 ][ERROR   ][9875] Encountered StreamClosedException
 363 2023-06-20 14:39:29,637 [salt.utils.event                                       :375 ][ERROR   ][9875] Encountered StreamClosedException
 364 2023-06-20 14:39:34,647 [salt.utils.event                                       :375 ][ERROR   ][9875] Encountered StreamClosedException

probably due to the exceptions thrown

The minion log files are full of Python 3.6, if they had upgrade correctly, it should be Python 3.10, looks like the minions are still using 3004.1 since using the default OS Python 3.6 that comes with RHEL 7. Salt API works only on the Master, but don't see the relevant name (or its IP Address) for the master in any of the logs, master or minion, or other logs. Wondering about the relevance of the logs since appear unrelated to the machine which the curl was applied to.

Note: doc link on how to upgrade to 3006.1 (onedir) https://docs.saltproject.io/salt/install-guide/en/3006/topics/upgrade-to-onedir.html

I went through the update process from Salt 3004.2 to Salt 3006.1 as directed by the documentation and it all looks good, and while it leaves a directory /usr/lib/python3.6/site-packages/salt, however it is empty.

They should upgrade as per the doc link and include the logs for the actual machine showing the issue, that would be /var/log/salt/master and /var/log/salt/api

dmurphy18 commented 1 year ago

Seeing issue with exceptions thrown

14224 2023-06-27 07:39:27,947 [cherrypy.error   :213 ][WARNING ][69492] [27/Jun/2023:07:39:27] ENGINE socket.error 8
14225 Traceback (most recent call last):
14226   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/server.py", line 1287, in communicate
14227     req.respond()
14228   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/server.py", line 1077, in respond
14229     self.server.gateway(self).respond()
14230   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/wsgi.py", line 145, in respond
14231     self.write(chunk)
14232   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/wsgi.py", line 231, in write
14233     self.req.write(chunk)
14234   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/server.py", line 1133, in write
14235     self.conn.wfile.write(chunk)
14236   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/makefile.py", line 438, in write
14237     res = super().write(val, *args, **kwargs)
14238   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/makefile.py", line 36, in write
14239     self._flush_unlocked()
14240   File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/makefile.py", line 45, in _flush_unlocked
14241     n = self.raw.write(bytes(self._write_buf))
14242   File "/opt/saltstack/salt/lib/python3.10/socket.py", line 723, in write
14243     return self._sock.send(b)
14244   File "/opt/saltstack/salt/lib/python3.10/ssl.py", line 1206, in send
14245     return self._sslobj.write(data)
14246 ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)

and may be related to a Python 3.10 issue https://bugs.python.org/issue45487 while is closed, need to check if fix exists in the version of Python 3.10 included in Salt 3006.1

dmurphy18 commented 1 year ago

This link may have a possible explanation for the issue not being in ssl but being surfaced by ssl, however further down in the comments a person also experienced similar problems and was not using encryption, so mileage may vary, https://github.com/boto/boto3/issues/3359#issuecomment-1200006400

However, further in the comments section, ssl errors can be caused by certificate issues https://github.com/boto/boto3/issues/3359#issuecomment-1335705475

A further interesting discussion on the problem (Python3.9) but detailing possible reasons are a threading issue, certificate handling and handshake issue https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1180349788 https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1183673626 https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1184482908

dmurphy18 commented 1 year ago

Another site with good information to consider, using AWS etc, but something to look into https://codetinkering.com/ssl-validation-failed-python/ and section EOF occurred in violation of protocol.

dmurphy18 commented 1 year ago

Run the command in the initial Jira issue, with salt-master settings for external_auth and saltuser on an up to date RHEL 7 from the machines command line, in a script

#!/bin/bash

count=1000000
idx=0

while [ ${idx} -le ${count} ]
do
    exec 3>&1 4>&2
    result=$(TIMEFORMAT="%R" ; { time curl -v -sSk https://localhost:8000/login -H "Accept: application/json" -m 45 -d eauth=auto -d username=saltuser -d password=saltuser 1>&3 2>&4; } 2>&1)
    exec 3>&- 4>&-
    ## echo "\nresult '${result}'"
    secs=$(echo ${result} | cut -d '.' -f 1)
    ## echo "secs '${secs}'"
    if [[ ${secs} -ne 0 ]]; then
        echo "too much time, '${result}', idx is '${idx}'"
        exit
    fi
    ((idx++))
    echo ""
    echo "idx is '${idx}'"
done

echo ""
echo "done ${count} loops without issue"

Basically hammering it, and currently at iteration 32173 without issue. Suspect the problem is occurring due to the user's environment given pounding on salt-api, with a similar simple setup is not showing issues.

cd into /var/log/salt and grep'ing showed no issues

[root@Unknown salt]# grep -i exception *
[root@Unknown salt]# grep -i SSLEOFError *
[root@Unknown salt]# 

Finally got to a time difference where the seconds was more than zero, idx 361109 and time was 1.852 seconds, not a very big time difference. There was nothing anomalous in the logs either, no exceptions or SSLEOFError.

Master's configuration file for external_auth

admin_acls: &admin_acls
  - '@wheel'   # to allow access to all wheel modules
  - '@runner'  # to allow access to all runner modules
  - '@jobs'    # to allow access to the jobs runner and/or wheel module
  - .*

external_auth:
  auto:
    saltuser: *admin_acls
dmurphy18 commented 1 year ago

The link from above https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1183673626 talks about

If you are using Certificates and you don't wait until the connection handshake finishes before you publish then you will get the:
ssl.ssleoferror: eof occurred in violation of protocol (_ssl.c:2396)

This is in the same function that the _ssl.c EOF problem is occurring, literally a problem with PyErr_CheckSignals, whereas our issue is with the retval not being 0, and it is set just before the PyErr_CheckSignals call.

Also from the above comments link https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1184482908, mentions

I was running the same python script on my computer at home and on a windows EC2 in AWS.
The one at home worked the one on EC2 didn't. I just (after a lot of googling and not finding much and wiresharking it) put a delay "time.sleep" after the connection to give it a little time for a handshake and it works on both now. Not the most elegant solution but my script is just a test script.
I'm sure there is a way to check the connection status but I didn't look for that. Good luck.

Hence beginning to think this may be a issue with handshaking on the connection, the time.sleep, being a crude form of delay giving the handshake to finish.

dmurphy18 commented 1 year ago

Further issues which also appear to have the error with no salt involved https://github.com/sabnzbd/sabnzbd/issues/2065

This Python issue may explain why the SSLEOFError is occurring, gets raised during TLS handshake peer closing connection https://github.com/python/cpython/commit/495bd035662fda29639f9d52bb6baebea31d72fa

Also this issue https://github.com/cherrypy/cheroot/issues/346 describes complete failure to connect after TLS connection terminated which appears similar to where the client fails to reconnect.

dmurphy18 commented 1 year ago

Here is an example of the cipher being an issue https://github.com/python/cpython/issues/89650, particularly https://github.com/python/cpython/issues/89650#issuecomment-1093933689

Also noted that 'onedir' salt is using Python 3.10 internally and hence is using TLS 1.3, however on RHEL 7, the default Python is 3.6 and it will use TLS 1.2. However TLS implementation should be capable of handling between the different versions, but thought to point it out.

dmurphy18 commented 1 year ago

Looking through Oracle changes on the Java 11 changelogs to see any issues that could affect TLS peer negotiation , list:

However searching for JAVA and SSLEOFError with HttpClient turned up this issue https://github.com/mlflow/mlflow/issues/8963 which has the exact same line 2426 producing the SSLEOFError, but here it is due to a limitation with urllib3.connectionpool retrying with Python 3.10 not handling large files, begging the question as to what the Java HttpClient is doing at the time.

dmurphy18 commented 1 year ago

Interesting comments in OpenSSL about openssl 1.1.1 and EOF issue, preserving here for easy of finding again:

dmurphy18 commented 1 year ago

Additional OpenSSL and LibreSSL (curl issues too) discussions

And while these relate around 1.1.1e, a number of the issues are still open, and not satisfactorily completed to a number of the people involved in the discussions.

One avenue of investigation is to force TLS v1.2 between server and client and see if the issue reoccurs (need to research how to do this), given the number of issues reported concerning TLS v1.3 and the number of open issues w.r.t OpenSSL v1.1.1 which reaches EOL on 11th September 2023 - two months away giving doubt to OpenSSL 1.1.1 issues being resolved at this point

dmurphy18 commented 1 year ago

The method to enable TLS v1.2 is done in the /etc/salt/master configuration file, for example:

 479 # Use TLS/SSL encrypted connection between master and minion.
 480 # Can be set to a dictionary containing keyword arguments corresponding to Python's
 481 # 'ssl.wrap_socket' method.
 482 # Default is None.
 483 #ssl:
 484 #    keyfile: <path_to_keyfile>
 485 #    certfile: <path_to_certfile>
 486 #    ssl_version: PROTOCOL_TLSv1_2

just need to uncomment and fill out as appropriate, can you have Dennis try this

dmurphy18 commented 1 year ago

Closing this since using work-around given the issues are in OpenSSL v1.1.1. Still waiting for report on forcing use of TLS v1.2, but persons gone on vacation, will add that detail if it comes, probably end of August.