Closed alromos closed 1 month ago
Hi @alromos,
Can we get some more information on what you're doing when the hang happens, preferably a repro app if you're able to consistently reproduce it. Additionally, if you can provide additional logging, that will help us as well. Instructions for logging with the JDBC driver can be found here: https://learn.microsoft.com/en-us/sql/connect/jdbc/tracing-driver-operation?view=sql-server-ver16
Hi @Jeffery-Wasty
Here are some details about our application:
mssql-jdbc
driver is used inside multiple logstash pipelines by logstash-integration-jdbc plugin to read some data from database.
Under the hood the driver is accessed by Sequel: The Database Toolkit for Ruby.
In the case mentioned in the original description com.microsoft.sqlserver.jdbc.SQLServerConnection
is accessed after the following method is called: com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery
by Sequel.
The issue is reproduced sporadically. It make take a day or a week to have it reproduced. Also, it looks like it is reproduced only when ActiveDirectoryMSI authentication method is used by the driver, since I haven't managed to reproduce the issue with a local database (hosted in Docker) but it is constantly reproduced in all non-local environments where the mentioned authentication method is used.
Here is a link to a repository with a simplified setup I tried to use in order to reproduce the issue locally: logstash-input-jdbc-scheduler-issue-test. Hope it may be helpful.
Could you let me know which exact logs (log category) should I enable in order to troubleshoot the issue?
I'm not able to enable ALL logs since it will create a mess in non-local environment, so I need to specify some specific log categories.
At the moment I have only com.microsoft.sqlserver.jdbc.internals.SQLServerConnection
logs at FINEST
level enabled.
As far as I understand from the logs posted in original description, the issue happens inside sendFedAuthToken
method of com.microsoft.sqlserver.jdbc.SQLServerConnection
class between those lines:
if (connectionlogger.isLoggable(Level.FINER)) {
connectionlogger.fine(toString() + " Sending federated authentication token.");
}
and
TDSParser.parse(tdsReader, tdsTokenHandler);
Hi @alromos, having logging set to FINEST
would be good. There should more logs to follow after Sending federated authentication token.
and might hint exactly where it's happening.
@Jeffery-Wasty @tkyc JFYI I've enabled the additional logging you requested and now am waiting for the issue to be reproduced. Will keep you posted.
@Jeffery-Wasty @tkyc
It looks like there are no additional logs after Sending federated authentication token.
event when I set FINEST
log level for the whole com.microsoft.sqlserver.jdbc
package.
The issue has been reproduced 2 times and result is the same. Please see the logs in CSV format attached.
terminated-pipeline-case-1.csv
terminated-pipeline-case-2.csv
UPDATE: Please ignore the information above. I had incorrect Application Insights configuration all this time and only FINER
logs were actually captured. I will repeat the log collection and keep you posted.
Here are the requested logs. Unfortunately the logs are ordered by time in milliseconds, so the actual log order may differ from the one in the file. terminated-pipeline-case-3.xlsx
By looking at the code I have a feeling the line is the last one:
TDSCommand@7b1f60ae (FedAuth): Reading response...
Do you have any ideas of what can be wrong?
Interesting.. right after that Reading response...
log message there's this block of code:
if (isAdaptive) {
tdsReader.readPacket();
} else {
while (tdsReader.readPacket());
}
isAdaptive
should be false, which means it hits that while loop. I'm starting to think that it might not be breaking out of that while loop. I'll need to investigate some more, if that's the case.
I also noticed the token expiration seems a bit off from the logs. Are you using azure functions with the driver by chance? Recently there was a hotfix made to correct the off token expiration when using the driver with azure functions. v11.2.3
should have this fix, but adding msiTokenCacheTtl=0
to your connection is a workaround for the driver version you're using.
I'm wondering if this off token expiration is the thing causing the issue, because readPacket
should definitely terminate as a packet has an EOM
flag, which we read. Have you ever seen a token expiration error? Something like Login failed for user '<token-identified principal>'. Token is expired. ClientConnectionId:2787d61f-5f69-476c-8df7-ce83dbb3ec57
? Really strange if you haven't.
@tkyc
The driver is used by a Spring Boot application and a Logstash instance (Java + JRuby). The logs I posted above belong to the Logstash instance.
I faced the issue with token expiration in Spring Boot application and it was fixed by adding msiTokenCacheTtl=0
as you mentioned. I was a part of this discussion: https://github.com/microsoft/mssql-jdbc/issues/1944
I didn't see the error in the Logstash instance logs. However I added the msiTokenCacheTtl=0
setting to the Logstash instance as well but it didn't help to resolve the connection hanging issue. You may see the following log entry in the file. The entry confirms the property is set to 0:
3/27/2023, 7:08:21.402 PM com.microsoft.sqlserver.jdbc.internals.SQLServerDriver DEBUG Property:msiTokenCacheTtl Value:0
I'll send you a custom 11.2.1.jre17 driver with additional logging to help shed some light on this issue when I'm able to. Will it be possible for you to run the custom driver jar in your environment (doesn't go against any policies etc..)? I want to add some more logging around the code block I shared earlier. Although, I still have some doubts about it being the readPacket
call in the while loop that's causing it (b/c of EOM
termination). I'll keep you posted, issue is still on my radar.
@tkyc I found the following statement in documentation of connection properties:
Query execution hangs and doesn't throw an exception if the TCP connection to the server is silently dropped.
So, I think it may be the case. I'm thinking about trying to set queryTimeout
and cancelQueryTimeout
properties to test this. Could you explain how to set the properties properly and how they are related?
I just wonder why default values of queryTimeout
and cancelQueryTimeout
are to wait indefinitely? Won't such settings hide the hanging connections?
@alromos Good point, I didn't think of the timeout settings.
queryTimeout
+ cancelQueryTimeout
are in seconds eg. jdbc:sqlserver://...;queryTimeout=<seconds>;cancelQueryTimeout=<seconds>;
queryTimeout
is the time in seconds the driver will wait between sending an execute command to the server and the server returning data before cancelling the command.
cancelQueryTimeout
is the time in seconds the driver will wait for an acknowledgement of the queryTimeout
cancellation from the server before forcefully terminating/closing the connection.
So, total time before the driver times out or closes the connection is queryTimeout
+ cancelQueryTimeout
.
I just wonder why default values of queryTimeout and cancelQueryTimeout are to wait indefinitely? Won't such settings hide the hanging connections?
That's legacy behaviour and yes you're right about it potentially causing hangs. cancelQueryTimeout
defaults to infinite because queryTimeout
defaults to infinite. And, since cancelQueryTimeout
results in a disconnected connection (which is pretty disruptive) it was likely better to wait for the server to respond eventually, so a late clean cancel instead of a disconnect.
@alromos Sorry for the long hiatus, just jumping back again on this. Did the queryTimeout and cancelQueryTimeout help?
@tkyc
I was verifying it all this time.
Unfortunately setting just queryTimeout
and both queryTimeput
and cancelQueryTimeout
didn't help to resolve the issue.
Connection is still hanging at some point and I don't see any errors related to the timeout settings.
@tkyc Hi! Are there any updates on this? Did you have a chance to prepare a custom driver version with additional logging in place?
@alromos sorry my attention has been divided, I'll have the driver for you sometime today.
@alromos The additional logging I've added can be found here. I've set the logging level for these logs to INFO
as the higher logging levels are very verbose and we're only looking in a specific area.
@tkyc
Thank you! I've successfully integrated the driver you provided.
I will provide the logs when I manage to reproduce the issue next time.
Special thanks for setting logging level to INFO
:)
@alromos Np, again sorry for the delay. Looking forward to the logs to see if it really hangs in the loop.
@tkyc See the additional logs attached. I've also included logs of a successful execution preceding the hanging one. Looks like you were right and connection hangs in the while loop. terminated-pipeline-case-4.xlsx
@alromos I've added more logging to confirm the EOM
from the packet, could you swap out the driver with the one attached to this comment? If there really is no EOM
to terminate on, I'd expect to see my new logging to be repeated. In the meantime, I'll need to read up on the packet parsing of the driver code.
Also, could you try setting socketTimeout
in your connection string? By default, socketTimeout
is infinite.
After getting the new logs from debug-logging-driver-p2.zip
, could you also provide logs from a run with the 12.2.0 driver with FINEST
logging enabled and also while setting socketTimeout
? I just realized in the initial FINEST
logging that was shared prior seems like it's missing FINE
logs (maybe filtered out?), so for this new FINEST
logs could you make sure FINE
logs are included as well (in addition to the other log levels eg. finer, info...).
@alromos Did the socketTimeout
setting make a difference?
@tkyc I still need some additional time to test all your suggestions. I have only one environment to test this and it may take up to 7-8 days to reproduce the issue. Here is my current plan:
11.2.1-p2
driver + no socketTimeout
- IN PROGRESS11.2.1-p2
driver + socketTimeout
- PENDING12.2.0
driver + socketTimeout
- PENDINGI'm waiting for the issue to reproduce with the first setup since May 4.
BTW, is it possible to have customized version of 12.2.0
driver as well? I'm asking this because verbose (FINEST) driver logging produces big amount of log entries and significantly affects our Azure costs.
Actually, since we have some logging in the endless while loop in 11.2.1-p2
I think it also may produce a lot of logs when the issue happens and will cost us a lot.
@alromos
We think your issue may be fixed by setting socketTimeout
to a non-zero value as that would terminate the read so the connection will drop and trigger the driver to reconnect (idle connection resiliency feature). The latest 12.2.0 driver has the latest fixes please move to testing scenario #3 (12.2.0 driver + socketTimeout) and see if that resolves your issue?
@tkyc I'm attaching logs of the first setup in case you want to investigate the issue further: terminated-pipeline-case-5.xlsx As you may see the logs that you've added are not repeated and the connection hangs after the first header packet read:
ConnectionID:121914 ClientConnectionId: 1dc99b90-a827-46a7-a7ce-7ad5d7dee9a4 github#2093 - Reading packet header...
@lilgreenbird
I'm going to follow your suggestion with 12.2.0
driver version and socketTimeout
setting and let you know if it works. However I'm not going to enable FINEST
logging because of the reasons mentioned in my previous comment.
I only have a question: is there any recommended value for socketTimeout
setting? I don't see any recommendations in the official documentation. It would be nice to have some note about this.
In the meanwhile I'm going to set it to 30000 (30 seconds) as recommended in the issue comment:
socket read timeout during connection creation when a user-assigned managed identity is used
@tkyc @lilgreenbird
I have an update related to 12.2.0
driver version.
It looks like maven central repository does not contain jre17
release of the version:
So, I'm going to test socketTimeout
setting with 12.3.0.jre17-preview
version.
@alromos Thanks for the logs. 30000 should be acceptable for the socketTimeout. Moving forward, drivers labeled jre11 are compatible/tested with JDKs 11+. So, 12.2.0.jre11 would be compatible with JDK 17.
BTW, is it possible to have customized version of 12.2.0 driver as well?
I'll prepare a customized 12.2.0 driver sometime today.
Logging changes can be seen here.
@alromos
In the recent versions we are only release jre8 and jre11 jars. The jre11 jar supports all jre11+ including 17. We accidentally released jre17 and jre20 in the 12.3.0 as well that should not have been there and is not necessary, sorry about the confusion.
@alromos any news? I'm hoping the socketTimeout will help resolve the issue.
@tkyc @lilgreenbird
Thank you for the jre17
label clarification!
Talking about the issue.
Currently I'm running application with 12.3.0.jre17-preview
version of your driver and socketTimeout
setting in test environment. Application have been running with the setup since May 12, I don't see any hangs so far.
Also I tried to enable socketTimeout
setting for our current setup with 11.2.1.jre17
driver version in higher envs. The fix was applied on May 18 and I haven't seen hangs since then. In addition, during the timeline an exception related to the setting was raised:
com.microsoft.sqlserver.jdbc.SQLServerException: Read timed out
However it was thrown in slightly different place I expected to see it. See the stack trace in the attachment: read-timeout-stack-trace.txt
So, it looks like the changes you suggested fixed the issue we faced. I would like to monitor it for some additional time (2-3 weeks) to make sure we won't face the issue again.
Outstanding questions from my side:
12.x.x
driver update the hanging code was completely re-written and now it relies on an external library (azure-identity
)? I faced a missing class issue in 12.3.0.jre17-preview
version because we were including only driver JAR in application classpath and I was forced to include com.azure:azure-identity
JAR and all its dependencies to make everything work.12.x.x
stable release is expected?@alromos
That's right, Managed Identity in 12.x.x
versions of the driver rely on azure-identity now to get the access token for authenticating. So, from 12.x.x
onwards you'll need to explicitly include azure-identity as a dependency.
That should happen after our next preview. Our next preview is on June 7th, but for our next stable release I'll need to double check confirm. I'll let you know when I get the exact date.
EDIT: Stable release will be on end of July
@tkyc Thank you for the answers! I think we may close the issue as resolved. In case of any further issues or concerns I will re-open it.
@tkyc
I would like to re-open the issue.
I've identified a case when setting socketTimeout
is not sufficient for us.
Sometimes I have long-running queries which exceed socketTimeout
value and cause read timeout. As a workaround I increased the setting value to 3 minutes (was 30 seconds). However now the setting does not seems so efficient in identifying "hanging" connections since it requires 3 minutes to do it.
That's why I would like to continue the investigation we started earlier.
I tried to enable additional logging with custom 12.2.0
JAR you provided however it produces too many logs. I cannot afford having so many logs in Azure environment since it will significantly increase our costs. Is there a chance you can revise the logging and keep INFO level for the most important log statements only?
Thank you!
I was talking to the team and we revisited our timeout logic. Right now, how the driver does timeouts doesn't look right to us. It isn't right that our current socketTimeout
logic is affecting both your queries and login. I'm marking this issue as an enhancement to fix this. The general idea for the fix is to have the timeouts during login (where your hang occurs) be respected separately from the timeout that occurs when you run your long queries (so that you won't have to have 3min affecting both login and queries).
For now as a workaround, can you open connections with socketTImeout=3min
for your longer running queries? And, for all other cases have it remain as 30s? Just so you can somewhat efficiently identify hang.
@tkyc Thank you for the update! It sounds promising.
Unfortunately at the moment we are stuck with setting socketTimeout
to 3 minutes for ALL queries.
The queries are used to read data for our reports. Usually they take a small amount of time because we read only the latest data updates. However in some cases we need to update the data structure for the reports and re-read ALL the report data. In this case the query execution takes up to 3 minutes. Since the data amount is constantly growing we will eventually need to increase the socketTimeout
value in future so with the current approach it will keep slowing down identification of hangs.
Having separate timeouts for login and data reads will certainly resolve the issue. With the approach we won't need to care of keeping socketTimeout
value up to date with the data volume (we will just remove the setting).
Hi @alromos,
I believe this issue is resolved with the changes to socketTimeout
for release 12.6.0. With these changes, socketTimeout
is adjusted based on the status of the connection. So if loginTimeout
is reached, as in the case of the above mentioned hanging connections, the connection is closed, regardless of how large socketTimeout
is. But if login is successful, socketTimeout
can still be set to a high value to account for long-running queries. If you're able to, we encourage you to check out the 12.6.1 stable release, and see if these recent changes address your scenario.
I'm closing this issue as, as stated above, I think the recent changes address the preceding issue. We can reopen this if it turns out not to be the case.
Driver version
11.2.1.jre17
SQL Server version
Microsoft SQL Azure (RTM) - 12.0.2000.8 Jan 12 2023 05:25:39 Copyright (C) 2022 Microsoft Corporation
Client Operating System
Ubuntu 20.04.5 LTS
JAVA/JVM version
openjdk version "17.0.5" 2022-10-18 OpenJDK Runtime Environment Temurin-17.0.5+8 (build 17.0.5+8) OpenJDK 64-Bit Server VM Temurin-17.0.5+8 (build 17.0.5+8, mixed mode, sharing)
Problem description
At some point of time any individual SQLServerConnection may hang after logging the following statement:
ConnectionID:{traceID} ClientConnectionId: {clientConnectionId} Sending federated authentication token.
The connections don't fail with any errors and don't recover. Only application restart helps to resolve the issue.Expected behavior
SQLServerConnection should successfully log the following statements and execute a corresponding SQL query.
Actual behavior
SQLServerConnection hangs without any further notice after the following log statement: