Closed Alquant closed 10 months ago
Hi @Alquant, thanks for your feedback. Could you elaborate a security mismatch between the python connector and the machine it tries to connect to
? The "machine" here, you mean a Snowflake warehouse? And may I know whether you are blocked by this error/warning?
Hi @sfc-gh-jdu. I suspect this issue to be similar to the scenario described in this StackOverflow discussion.
To provide more context:
Given these factors, I hypothesize that the communication between our server on Cloud Run and the Snowflake data warehouse might be encountering complications. Specifically, it seems plausible that a component within our infrastructure stack is not fully synchronized with the latest security protocol standards, leading to these occasional handshake failures.
Addressing your final query, our operations are not currently hampered to the point of standstill—we've managed to maintain functionality despite the frequent errors, contrasting with situations in previous months that necessitated server reboots to restore connectivity with Snowflake. However, my concern leans towards understanding the root cause of this issue, as it stands to reason that this could be a symptom of a more substantial underlying problem.
I'm escalating this issue as it's proving more problematic than initially anticipated. As I mentioned in my previous post, we're encountering an issue where, despite the 'Failed to get the response. Hanging?' error not severing the connection, repeated failures are occurring with our heartbeat checks set at a 1-hour frequency.
The critical concern here is that these consecutive failures are happening frequently enough to invalidate the session after 4 hours. This aligns with Snowflake's authentication token policy, as detailed in their documentation here. Consequently, this leads to an 'Authentication token has expired. The user must authenticate again.' error.
In response to this, we are currently developing our own heartbeat mechanism. However, we view this as a temporary solution rather than a permanent fix, as it introduces significant additional costs that we would prefer to avoid.
We've identified a probable cause for the erratic errors we've been experiencing. As mentioned previously, our Django server is hosted on Cloud Run, where we utilise the cpu-throttling
option to manage costs. This setup means CPU resources are allocated primarily during active request processing. Consequently, this could lead to minimal computing power being available during heartbeat intervals.
Our investigation suggests that the transition to version 37 of the cryptography library may be a key factor. It appears that prior to this update, the computational demands for executing the security and cryptographic processes during heartbeats were relatively low. This would explain why the Hanging?
error was rarely observed before. The updated version likely introduced more complex processing requirements, making it more challenging for the heartbeats to secure sufficient CPU resources under the cpu-throttling mode.Successful heartbeats in this context may just be instances where they luckily received the limited available CPU power.
Python version
3.10.12
Operating system and processor architecture
Linux-4.4.0-x86_64-with-glibc2.27
Installed packages
What did you do?
What did you expect to see?
Note that it's linked to #1361. I attached a screenshot of one of the countless similar log I get with machines deployed on Google Cloud Run. You can see a warning
SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')"))
followed by an errorFailed to get the response. Hanging?
. A few things:requestId
andrequest_guid
each time, in a span of ~15 seconds. This leads to 4 errorsHanging?
for the same machine, also in a span of ~15 seconds.<2.8.1
, this error almost entirely disappear. It still happens randomly, and very rarely (~2-3 occurrences every month, deployed on at least 50 machines).After looking online for similar issue, this
SysCallError
tends to happen because of a security mismatch between the python connector and the machine it tries to connect to. I tried to analyse the level of security on the Snowflake machines, it looks good, however when I ping our snowflake url, I see quite a few different IPs, so my questions are:Retry(total=0...
. Is it somehow possible to ask a couple retry before having theHanging?
?Can you set logging to DEBUG and collect the logs?