Closed dmurphy18 closed 1 year ago
Concerned that the versions reported for cherrypy are unknown cherrypy: unknown
when they should be reporting cherrypy: 18.6.1
, which makes me wonder how the upgrade from 3004.2 to 2006.1 was performed.
@Phillip-A-Fry test ping
Adding output from salt-run --versions-report
(to include CherryPy version):
Salt Version: Salt: 3006.1
Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]
Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4
System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo
So issue with how they upgraded from 3004.1 to 3006.1. The log file minion-20230621 has Python 3.6 for example, 3006.1 would be Python 3.10, and not the default Python 3.6 for RHEL 7
217 File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 624, in send
218 recv = yield future
219 File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
220 value = future.result()
221 File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
222 raise_exc_info(self._exc_info)
223 File "<string>", line 4, in raise_exc_info
224 salt.exceptions.SaltReqTimeoutError: Message timed out
Lots of exception in tornado too, and ERRORs
362 2023-06-20 14:39:24,626 [salt.utils.event :375 ][ERROR ][9875] Encountered StreamClosedException
363 2023-06-20 14:39:29,637 [salt.utils.event :375 ][ERROR ][9875] Encountered StreamClosedException
364 2023-06-20 14:39:34,647 [salt.utils.event :375 ][ERROR ][9875] Encountered StreamClosedException
probably due to the exceptions thrown
The minion log files are full of Python 3.6, if they had upgrade correctly, it should be Python 3.10, looks like the minions are still using 3004.1 since using the default OS Python 3.6 that comes with RHEL 7. Salt API works only on the Master, but don't see the relevant name (or its IP Address) for the master in any of the logs, master or minion, or other logs. Wondering about the relevance of the logs since appear unrelated to the machine which the curl was applied to.
Note: doc link on how to upgrade to 3006.1 (onedir) https://docs.saltproject.io/salt/install-guide/en/3006/topics/upgrade-to-onedir.html
I went through the update process from Salt 3004.2 to Salt 3006.1 as directed by the documentation and it all looks good, and while it leaves a directory /usr/lib/python3.6/site-packages/salt
, however it is empty.
They should upgrade as per the doc link and include the logs for the actual machine showing the issue, that would be /var/log/salt/master and /var/log/salt/api
Seeing issue with exceptions thrown
14224 2023-06-27 07:39:27,947 [cherrypy.error :213 ][WARNING ][69492] [27/Jun/2023:07:39:27] ENGINE socket.error 8
14225 Traceback (most recent call last):
14226 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/server.py", line 1287, in communicate
14227 req.respond()
14228 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/server.py", line 1077, in respond
14229 self.server.gateway(self).respond()
14230 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/wsgi.py", line 145, in respond
14231 self.write(chunk)
14232 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/wsgi.py", line 231, in write
14233 self.req.write(chunk)
14234 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/server.py", line 1133, in write
14235 self.conn.wfile.write(chunk)
14236 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/makefile.py", line 438, in write
14237 res = super().write(val, *args, **kwargs)
14238 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/makefile.py", line 36, in write
14239 self._flush_unlocked()
14240 File "/opt/saltstack/salt/lib/python3.10/site-packages/cheroot/makefile.py", line 45, in _flush_unlocked
14241 n = self.raw.write(bytes(self._write_buf))
14242 File "/opt/saltstack/salt/lib/python3.10/socket.py", line 723, in write
14243 return self._sock.send(b)
14244 File "/opt/saltstack/salt/lib/python3.10/ssl.py", line 1206, in send
14245 return self._sslobj.write(data)
14246 ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)
and may be related to a Python 3.10 issue https://bugs.python.org/issue45487 while is closed, need to check if fix exists in the version of Python 3.10 included in Salt 3006.1
This link may have a possible explanation for the issue not being in ssl but being surfaced by ssl, however further down in the comments a person also experienced similar problems and was not using encryption, so mileage may vary, https://github.com/boto/boto3/issues/3359#issuecomment-1200006400
However, further in the comments section, ssl errors can be caused by certificate issues https://github.com/boto/boto3/issues/3359#issuecomment-1335705475
A further interesting discussion on the problem (Python3.9) but detailing possible reasons are a threading issue, certificate handling and handshake issue https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1180349788 https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1183673626 https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1184482908
Another site with good information to consider, using AWS etc, but something to look into
https://codetinkering.com/ssl-validation-failed-python/ and section EOF occurred in violation of protocol
.
Run the command in the initial Jira issue, with salt-master settings for external_auth and saltuser on an up to date RHEL 7 from the machines command line, in a script
#!/bin/bash
count=1000000
idx=0
while [ ${idx} -le ${count} ]
do
exec 3>&1 4>&2
result=$(TIMEFORMAT="%R" ; { time curl -v -sSk https://localhost:8000/login -H "Accept: application/json" -m 45 -d eauth=auto -d username=saltuser -d password=saltuser 1>&3 2>&4; } 2>&1)
exec 3>&- 4>&-
## echo "\nresult '${result}'"
secs=$(echo ${result} | cut -d '.' -f 1)
## echo "secs '${secs}'"
if [[ ${secs} -ne 0 ]]; then
echo "too much time, '${result}', idx is '${idx}'"
exit
fi
((idx++))
echo ""
echo "idx is '${idx}'"
done
echo ""
echo "done ${count} loops without issue"
Basically hammering it, and currently at iteration 32173 without issue. Suspect the problem is occurring due to the user's environment given pounding on salt-api, with a similar simple setup is not showing issues.
cd into /var/log/salt and grep'ing showed no issues
[root@Unknown salt]# grep -i exception *
[root@Unknown salt]# grep -i SSLEOFError *
[root@Unknown salt]#
Finally got to a time difference where the seconds was more than zero, idx 361109 and time was 1.852 seconds, not a very big time difference. There was nothing anomalous in the logs either, no exceptions or SSLEOFError.
Master's configuration file for external_auth
admin_acls: &admin_acls
- '@wheel' # to allow access to all wheel modules
- '@runner' # to allow access to all runner modules
- '@jobs' # to allow access to the jobs runner and/or wheel module
- .*
external_auth:
auto:
saltuser: *admin_acls
The link from above https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1183673626 talks about
If you are using Certificates and you don't wait until the connection handshake finishes before you publish then you will get the:
ssl.ssleoferror: eof occurred in violation of protocol (_ssl.c:2396)
This is in the same function that the _ssl.c EOF problem is occurring, literally a problem with PyErr_CheckSignals, whereas our issue is with the retval not being 0, and it is set just before the PyErr_CheckSignals call.
Also from the above comments link https://github.com/eclipse/paho.mqtt.python/issues/637#issuecomment-1184482908, mentions
I was running the same python script on my computer at home and on a windows EC2 in AWS.
The one at home worked the one on EC2 didn't. I just (after a lot of googling and not finding much and wiresharking it) put a delay "time.sleep" after the connection to give it a little time for a handshake and it works on both now. Not the most elegant solution but my script is just a test script.
I'm sure there is a way to check the connection status but I didn't look for that. Good luck.
Hence beginning to think this may be a issue with handshaking on the connection, the time.sleep, being a crude form of delay giving the handshake to finish.
Further issues which also appear to have the error with no salt involved https://github.com/sabnzbd/sabnzbd/issues/2065
This Python issue may explain why the SSLEOFError is occurring, gets raised during TLS handshake peer closing connection https://github.com/python/cpython/commit/495bd035662fda29639f9d52bb6baebea31d72fa
Also this issue https://github.com/cherrypy/cheroot/issues/346 describes complete failure to connect after TLS connection terminated which appears similar to where the client fails to reconnect.
Here is an example of the cipher being an issue https://github.com/python/cpython/issues/89650, particularly https://github.com/python/cpython/issues/89650#issuecomment-1093933689
Also noted that 'onedir' salt is using Python 3.10 internally and hence is using TLS 1.3, however on RHEL 7, the default Python is 3.6 and it will use TLS 1.2. However TLS implementation should be capable of handling between the different versions, but thought to point it out.
Looking through Oracle changes on the Java 11 changelogs to see any issues that could affect TLS peer negotiation , list:
However searching for JAVA and SSLEOFError with HttpClient turned up this issue https://github.com/mlflow/mlflow/issues/8963 which has the exact same line 2426 producing the SSLEOFError, but here it is due to a limitation with urllib3.connectionpool retrying with Python 3.10 not handling large files, begging the question as to what the Java HttpClient is doing at the time.
Interesting comments in OpenSSL about openssl 1.1.1 and EOF issue, preserving here for easy of finding again:
628 case SSL_ERROR_SYSCALL:
629 {
630 if (e == 0) {
631 PySocketSockObject *s = GET_SOCKET(sslsock);
632 if (ret == 0 || (((PyObject *)s) == Py_None)) {
633 p = PY_SSL_ERROR_EOF;
634 type = state->PySSLEOFErrorObject;
635 errstr = "EOF occurred in violation of protocol";
636 } else if (s && ret == -1) {
637 /* underlying BIO reported an I/O error */
638 ERR_clear_error();
639 #ifdef MS_WINDOWS
640 if (err.ws) {
641 return PyErr_SetFromWindowsErr(err.ws);
642 }
643 #endif
644 if (err.c) {
645 errno = err.c;
646 return PyErr_SetFromErrno(PyExc_OSError);
647 }
648 else {
649 p = PY_SSL_ERROR_EOF;
650 type = state->PySSLEOFErrorObject;
651 errstr = "EOF occurred in violation of protocol";
652 }
653 } else { /* possible? */
654 p = PY_SSL_ERROR_SYSCALL;
655 type = state->PySSLSyscallErrorObject;
656 errstr = "Some I/O error occurred";
657 }
Also the Python fix https://github.com/python/cpython/commit/495bd035662fda29639f9d52bb6baebea31d72fa removed
Py_INCREF(s);
s->errorhandler();
Py_DECREF(s);
return NULL;
Around the errorhandler, wonder why that was needed, but not now, and PySocketSockObject *s = GET_SOCKET(sslsock);
, but probably old code since Doc/whatsnew/3.8.rst states
2075 #if PY_VERSION_HEX < 0x03080000
2076 // Workaround for Python issue 35810; no longer necessary in Python 3.8
2077 PY_INCREF(type)
2078 #endif
but wonder if there is still an issue here, given inc/dec references around an errorhandler might still be applicable, but doubt it given the only reference to PY_INCREF, is in the 3.8 doc, nothing in the code.
Additional OpenSSL and LibreSSL (curl issues too) discussions
And while these relate around 1.1.1e, a number of the issues are still open, and not satisfactorily completed to a number of the people involved in the discussions.
One avenue of investigation is to force TLS v1.2 between server and client and see if the issue reoccurs (need to research how to do this), given the number of issues reported concerning TLS v1.3 and the number of open issues w.r.t OpenSSL v1.1.1 which reaches EOL on 11th September 2023 - two months away giving doubt to OpenSSL 1.1.1 issues being resolved at this point
The method to enable TLS v1.2 is done in the /etc/salt/master configuration file, for example:
479 # Use TLS/SSL encrypted connection between master and minion.
480 # Can be set to a dictionary containing keyword arguments corresponding to Python's
481 # 'ssl.wrap_socket' method.
482 # Default is None.
483 #ssl:
484 # keyfile: <path_to_keyfile>
485 # certfile: <path_to_certfile>
486 # ssl_version: PROTOCOL_TLSv1_2
just need to uncomment and fill out as appropriate, can you have Dennis try this
Closing this since using work-around given the issues are in OpenSSL v1.1.1. Still waiting for report on forcing use of TLS v1.2, but persons gone on vacation, will add that detail if it comes, probably end of August.
Description Salt-API unstable after upgrade from 3004.2 to 3006.1
Setup Customer is not using Aria Config components, only Salt open
Enabled eauth="auto" on one of our salt-master's in our failing -test environment.
external_auth: auto: saltuser: *admin_acls
then attempted to curl the salt-api. note the time it took to complete: 0m0.293s time curl -v -sSk https://s12345.svr.us.dog.net:8000/login -H 'Accept: application/json' -m 45 -d eauth=auto -d username=saltuser -d password=saltuser
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior enabled eauth="auto" on one of our salt-master's in our failing -test environment.
external_auth: auto: saltuser: *admin_acls
then attempted to curl the salt-api. note the time it took to complete: 0m0.293s
time curl -v -sSk https://s12345.svr.us.dog.net:8000/login -H 'Accept: application/json' -m 45 -d eauth=auto -d username=saltuser -d password=saltuser
About to connect() to s12345.svr.us.dog.net port 8000 (#0) Trying 192.168.173.29... Connected to s12345.svr.us.dog.net (192.168.173.29) port 8000 (#0) Initializing NSS with certpath: sql:/etc/pki/nssdb skipping SSL peer certificate verification SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 Server certificate: subject: CN=cassandra-test.dpg.net,OU=gti,O=Dog,L=Kat,ST=Ohio,C=US start date: Jul 20 10:16:27 2022 GMT expire date: Jul 20 10:16:27 2023 GMT common name: cassandra-test.dog.net issuer: CN=PSIN0P551,DC=exchad,DC=dog,DC=net
I then proceeded to run that curl command a few more times with similar results. However, after a minute or two, the request was unable to complete in under 45 seconds:
time curl -v -sSk https://mouse.svr.us.dog.net:8000/login -H 'Accept: application/json' -m 45 -d eauth=auto -d username=saltuser -d password=saltuser
About to connect() to mouse.svr.us.dog.net port 8000 (#0) Trying 192.169.200.136... Connected to mouse.svr.us.dog.net (192.169.200.136) port 8000 (#0) Initializing NSS with certpath: sql:/etc/pki/nssdb skipping SSL peer certificate verification SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 Server certificate: subject: CN=cassandra-test.dog.net,OU=gti,O=Dog,L=Kat,ST=Ohio,C=US start date: Jul 20 10:16:27 2022 GMT expire date: Jul 20 10:16:27 2023 GMT common name: cassandra-test.dpg.net issuer: CN=PSIN0P551,DC=exchad,DC=dog,DC=net
Expected behavior:
Versions Report:
Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]
Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4
System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo
s12345: Salt Version: Salt: 3006.1
Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]
Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4
System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo
Milkcart: Salt Version: Salt: 3006.1
Python Version: Python: 3.10.11 (main, May 5 2023, 02:31:54) [GCC 11.2.0]
Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.31 Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 relenv: 0.12.3 smmap: 5.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4
System Versions: dist: rhel 7.9 Maipo locale: utf-8 machine: x86_64 release: 3.10.0-1160.83.1.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo
minion/refresh/horse
{"Minion data cache refresh": "horse", "_stamp": "2023-06-22T14:05:47.065752"} _salt_error {"pretag": null, "cmd": "_minion_event", "tag": "_salt_error", "data":
{"message": "The minion function caused an exception", "args": ["The minion function caused an exception"], "_stamp": "2023-06-22T14:05:47.391652"} , "id": "carthorse", "_stamp": "2023-06-22T14:05:47.426772"}
--THROUGH TEMPLATE Salt Stack Default, PR is created by sfry, reviewed by jonessean,lskrzypczak,christopherw and submitted by jonessean--
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml PASTE HERE ```Additional context Add any other context about the problem here.