SNOW-937263: Connector "keep alive" feature reports SSLError with version > 2.8.0

Alquant commented 11 months ago

Python version

3.10.12

Operating system and processor architecture

Linux-4.4.0-x86_64-with-glibc2.27

Installed packages

Django-4.2.6 MarkupSafe-2.1.3 Unidecode-1.3.7 aiohttp-3.8.6 aiosignal-1.3.1 asgiref-3.7.2 asn1crypto-1.5.1 astroid-3.0.0 async-timeout-4.0.3 attrs-23.1.0 beautifulsoup4-4.12.2 black-23.9.1 bs4-0.0.1 cachecontrol-0.13.1 cachetools-5.3.1 certifi-2023.7.22 cffi-1.16.0 cfgv-3.4.0 charset-normalizer-3.3.0 click-8.1.7 common-py-0.6.0 contourpy-1.1.1 cryptography-41.0.4 cycler-0.12.1 decorator-5.1.1 dill-0.3.7 distlib-0.3.7 django-cors-headers-4.2.0 djangorestframework-3.14.0 djangorestframework-jsonapi-6.1.0 dnspython-2.4.2 et-xmlfile-1.1.0 exchange-calendars-4.5 filelock-3.12.4 firebase-admin-6.2.0 flake8-6.1.0 fonttools-4.43.1 freezegun-1.2.2 frozenlist-1.4.0 fsspec-2023.9.2 gcsfs-2023.9.2 google-api-core-2.12.0 google-api-python-client-2.103.0 google-auth-2.23.3 google-auth-httplib2-0.1.1 google-auth-oauthlib-1.1.0 google-cloud-appengine-logging-1.3.2 google-cloud-audit-log-0.2.5 google-cloud-core-2.3.3 google-cloud-firestore-2.12.0 google-cloud-logging-3.8.0 google-cloud-pubsub-2.18.4 google-cloud-run-0.9.1 google-cloud-secret-manager-2.16.4 google-cloud-storage-2.11.0 google-cloud-tasks-2.14.2 google-crc32c-1.5.0 google-resumable-media-2.6.0 googleapis-common-protos-1.60.0 greenlet-3.0.0 grpc-google-iam-v1-0.12.6 grpcio-1.59.0 grpcio-status-1.59.0 gunicorn-21.2.0 hi-dateinfer-0.4.6 httplib2-0.22.0 identify-2.5.30 idna-3.4 inflection-0.5.1 intrinio_sdk-6.26.1 investpy-1.0.8 isort-5.12.0 joblib-1.3.2 kiwisolver-1.4.5 korean-lunar-calendar-0.3.1 lxml-4.9.3 mailjet_rest-1.3.4 matplotlib-3.8.0 mccabe-0.7.0 more-itertools-10.1.0 msgpack-1.0.7 multidict-6.0.4 mypy-extensions-1.0.0 networkx-3.1 nodeenv-1.8.0 numpy-1.26.0 oauthlib-3.2.2 openpyxl-3.1.2 oscrypto-1.3.0 overrides-7.4.0 packaging-23.2 pandas-2.0.3 pathspec-0.11.2 pillow-10.0.1 platformdirs-3.11.0 pre-commit-3.4.0 proto-plus-1.22.3 protobuf-4.24.4 pyOpenSSL-23.2.0 pyarrow-10.0.1 pyasn1-0.5.0 pyasn1-modules-0.3.0 pycodestyle-2.11.0 pycoingecko-3.1.0 pycparser-2.21 pycryptodomex-3.19.0 pyflakes-3.1.0 pyjwt-2.8.0 pylint-3.0.1 pyluach-2.2.0 pymongo-4.5.0 pyparsing-3.1.1 python-dateutil-2.8.2 python-dotenv-1.0.0 pytz-2023.3.post1 pyyaml-6.0.1 quandl-3.7.0 requests-2.31.0 requests-oauthlib-1.3.1 rest-pandas-1.1.0 retrying-1.3.4 rsa-4.9 scikit-learn-1.3.1 scipy-1.11.3 sentry-sdk-1.31.0 setuptools-68.2.2 six-1.16.0 snowflake-connector-python-3.2.1 snowflake-sqlalchemy-1.5.0 sortedcontainers-2.4.0 soupsieve-2.5 sqlalchemy-1.4.49 sqlparse-0.4.4 threadpoolctl-3.2.0 tomli-2.0.1 tomlkit-0.12.1 toolz-0.12.0 typing-extensions-4.8.0 tzdata-2023.3 uritemplate-4.1.1 urllib3-1.26.17 virtualenv-20.24.5 werkzeug-3.0.0 wheel-0.41.2 xlrd-2.0.1 yarl-1.9.2

What did you do?

Created a snowflake connection, inside a server run with Django, on Google Cloud Run, build with build pack, and used the args "client_session_keep_alive=True" and "client_session_keep_alive_heartbeat_frequency=900"

What did you expect to see?

Note that it's linked to #1361. I attached a screenshot of one of the countless similar log I get with machines deployed on Google Cloud Run. You can see a warning SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")) followed by an error Failed to get the response. Hanging?. A few things:

Compared to the previous issue #1361, even if we get a lot of those errors, now the backend is still able to answer all the other calls we make, hence seeing this error doesn't mean that we need to fully reboot all our servers anymore.
We have 4 machines running for our tests, and which ones are affected by this error is a random behaviour. For example after booting the machines, only one got the error when requesting the first heartbeat (900 seconds). Later on, this one failed again while a second one failed too, however for the next couple of hours, nothing, no error (and no force reboot by Cloud Run).
Link to the previous two point, for issue #1361, the error was happening for each heartbeat of all the machines. So something has changed since then.
For an unknown reason (for me at least), sometimes when this error happens for one machine, we actually get 4 warnings with a different set of requestId and request_guid each time, in a span of ~15 seconds. This leads to 4 errors Hanging? for the same machine, also in a span of ~15 seconds.
Like said in #1361, by pinning the connection to <2.8.1, this error almost entirely disappear. It still happens randomly, and very rarely (~2-3 occurrences every month, deployed on at least 50 machines).

After looking online for similar issue, this SysCallError tends to happen because of a security mismatch between the python connector and the machine it tries to connect to. I tried to analyse the level of security on the Snowflake machines, it looks good, however when I ping our snowflake url, I see quite a few different IPs, so my questions are:

Is it possible that the "security" level on some of the snowflake machines are not updated to the latest version, hence provoking this error?
Could all of this problem comes from a proxy between Google Cloud Run and Snowflake? leading to the previous question again, security level not updated on this possible proxy?
As you can see in the screenshot, we see a Retry(total=0.... Is it somehow possible to ask a couple retry before having the Hanging??

Screenshot 2023-10-11 at 11 06 48

Can you set logging to DEBUG and collect the logs?

import logging
import os

for logger_name in ('snowflake.connector',):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)

sfc-gh-jdu commented 11 months ago

Hi @Alquant, thanks for your feedback. Could you elaborate a security mismatch between the python connector and the machine it tries to connect to? The "machine" here, you mean a Snowflake warehouse? And may I know whether you are blocked by this error/warning?

Alquant commented 11 months ago

Hi @sfc-gh-jdu. I suspect this issue to be similar to the scenario described in this StackOverflow discussion.

To provide more context:

We've observed that constraining the cryptography library to versions earlier than 37.0.0 mitigates these errors, implying a potential issue with recent updates in this package, particularly concerning security protocols.
The issue doesn't consistently reproduce, but its sporadic nature suggests that there might be an inconsistency in the implementation or updating of security protocols, perhaps due to recent partial updates.

Given these factors, I hypothesize that the communication between our server on Cloud Run and the Snowflake data warehouse might be encountering complications. Specifically, it seems plausible that a component within our infrastructure stack is not fully synchronized with the latest security protocol standards, leading to these occasional handshake failures.

Addressing your final query, our operations are not currently hampered to the point of standstill—we've managed to maintain functionality despite the frequent errors, contrasting with situations in previous months that necessitated server reboots to restore connectivity with Snowflake. However, my concern leans towards understanding the root cause of this issue, as it stands to reason that this could be a symptom of a more substantial underlying problem.

Alquant commented 10 months ago

I'm escalating this issue as it's proving more problematic than initially anticipated. As I mentioned in my previous post, we're encountering an issue where, despite the 'Failed to get the response. Hanging?' error not severing the connection, repeated failures are occurring with our heartbeat checks set at a 1-hour frequency.

The critical concern here is that these consecutive failures are happening frequently enough to invalidate the session after 4 hours. This aligns with Snowflake's authentication token policy, as detailed in their documentation here. Consequently, this leads to an 'Authentication token has expired. The user must authenticate again.' error.

In response to this, we are currently developing our own heartbeat mechanism. However, we view this as a temporary solution rather than a permanent fix, as it introduces significant additional costs that we would prefer to avoid.

Alquant commented 10 months ago

We've identified a probable cause for the erratic errors we've been experiencing. As mentioned previously, our Django server is hosted on Cloud Run, where we utilise the cpu-throttling option to manage costs. This setup means CPU resources are allocated primarily during active request processing. Consequently, this could lead to minimal computing power being available during heartbeat intervals.

Our investigation suggests that the transition to version 37 of the cryptography library may be a key factor. It appears that prior to this update, the computational demands for executing the security and cryptographic processes during heartbeats were relatively low. This would explain why the Hanging? error was rarely observed before. The updated version likely introduced more complex processing requirements, making it more challenging for the heartbeats to secure sufficient CPU resources under the cpu-throttling mode.Successful heartbeats in this context may just be instances where they luckily received the limited available CPU power.

snowflakedb / snowflake-connector-python