microsoft / mssql-jdbc

The Microsoft JDBC Driver for SQL Server is a Type 4 JDBC driver that provides database connectivity with SQL Server through the standard JDBC application program interfaces (APIs).
MIT License
1.04k stars 422 forks source link

SQLServerConnection hangs after logging the following statement: "Sending federated authentication token." #2093

Closed alromos closed 1 month ago

alromos commented 1 year ago

Driver version

11.2.1.jre17

SQL Server version

Microsoft SQL Azure (RTM) - 12.0.2000.8 Jan 12 2023 05:25:39 Copyright (C) 2022 Microsoft Corporation

Client Operating System

Ubuntu 20.04.5 LTS

JAVA/JVM version

openjdk version "17.0.5" 2022-10-18 OpenJDK Runtime Environment Temurin-17.0.5+8 (build 17.0.5+8) OpenJDK 64-Bit Server VM Temurin-17.0.5+8 (build 17.0.5+8, mixed mode, sharing)

Problem description

At some point of time any individual SQLServerConnection may hang after logging the following statement: ConnectionID:{traceID} ClientConnectionId: {clientConnectionId} Sending federated authentication token. The connections don't fail with any errors and don't recover. Only application restart helps to resolve the issue.

Expected behavior

SQLServerConnection should successfully log the following statements and execute a corresponding SQL query.

ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Sending federated authentication token.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Received feature extension acknowledgement for Data Classification.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Received feature extension acknowledgement for UTF8 support.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Network packet size is 8000 bytes
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Received feature extension acknowledgement for AE.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Received feature extension acknowledgement for Azure SQL DNS Caching.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Received feature extension acknowledgement for Idle Connection Resiliency.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb Received feature extension acknowledgement for federated authentication.
ConnectionID:12180 ClientConnectionId: 9e5151cc-d906-4533-8cca-9ad30cf46deb End of connect

Actual behavior

SQLServerConnection hangs without any further notice after the following log statement:

ConnectionID:12217 ClientConnectionId: fb3d38e1-d673-4c95-a552-5be541556a3f Sending federated authentication token.
Jeffery-Wasty commented 1 year ago

Hi @alromos,

Can we get some more information on what you're doing when the hang happens, preferably a repro app if you're able to consistently reproduce it. Additionally, if you can provide additional logging, that will help us as well. Instructions for logging with the JDBC driver can be found here: https://learn.microsoft.com/en-us/sql/connect/jdbc/tracing-driver-operation?view=sql-server-ver16

alromos commented 1 year ago

Hi @Jeffery-Wasty Here are some details about our application: mssql-jdbc driver is used inside multiple logstash pipelines by logstash-integration-jdbc plugin to read some data from database. Under the hood the driver is accessed by Sequel: The Database Toolkit for Ruby. In the case mentioned in the original description com.microsoft.sqlserver.jdbc.SQLServerConnection is accessed after the following method is called: com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery by Sequel.

The issue is reproduced sporadically. It make take a day or a week to have it reproduced. Also, it looks like it is reproduced only when ActiveDirectoryMSI authentication method is used by the driver, since I haven't managed to reproduce the issue with a local database (hosted in Docker) but it is constantly reproduced in all non-local environments where the mentioned authentication method is used.

Here is a link to a repository with a simplified setup I tried to use in order to reproduce the issue locally: logstash-input-jdbc-scheduler-issue-test. Hope it may be helpful.

Could you let me know which exact logs (log category) should I enable in order to troubleshoot the issue? I'm not able to enable ALL logs since it will create a mess in non-local environment, so I need to specify some specific log categories. At the moment I have only com.microsoft.sqlserver.jdbc.internals.SQLServerConnection logs at FINEST level enabled. As far as I understand from the logs posted in original description, the issue happens inside sendFedAuthToken method of com.microsoft.sqlserver.jdbc.SQLServerConnection class between those lines:

if (connectionlogger.isLoggable(Level.FINER)) {
    connectionlogger.fine(toString() + " Sending federated authentication token.");
}

and

TDSParser.parse(tdsReader, tdsTokenHandler);
tkyc commented 1 year ago

Hi @alromos, having logging set to FINEST would be good. There should more logs to follow after Sending federated authentication token. and might hint exactly where it's happening.

alromos commented 1 year ago

@Jeffery-Wasty @tkyc JFYI I've enabled the additional logging you requested and now am waiting for the issue to be reproduced. Will keep you posted.

alromos commented 1 year ago

@Jeffery-Wasty @tkyc It looks like there are no additional logs after Sending federated authentication token. event when I set FINEST log level for the whole com.microsoft.sqlserver.jdbc package. The issue has been reproduced 2 times and result is the same. Please see the logs in CSV format attached. terminated-pipeline-case-1.csv terminated-pipeline-case-2.csv

UPDATE: Please ignore the information above. I had incorrect Application Insights configuration all this time and only FINER logs were actually captured. I will repeat the log collection and keep you posted.

alromos commented 1 year ago

Here are the requested logs. Unfortunately the logs are ordered by time in milliseconds, so the actual log order may differ from the one in the file. terminated-pipeline-case-3.xlsx

By looking at the code I have a feeling the line is the last one: TDSCommand@7b1f60ae (FedAuth): Reading response...

Do you have any ideas of what can be wrong?

tkyc commented 1 year ago

Interesting.. right after that Reading response... log message there's this block of code:

            if (isAdaptive) {
                tdsReader.readPacket();
            } else {
                while (tdsReader.readPacket());
            }

isAdaptive should be false, which means it hits that while loop. I'm starting to think that it might not be breaking out of that while loop. I'll need to investigate some more, if that's the case.

tkyc commented 1 year ago

I also noticed the token expiration seems a bit off from the logs. Are you using azure functions with the driver by chance? Recently there was a hotfix made to correct the off token expiration when using the driver with azure functions. v11.2.3 should have this fix, but adding msiTokenCacheTtl=0 to your connection is a workaround for the driver version you're using.

I'm wondering if this off token expiration is the thing causing the issue, because readPacket should definitely terminate as a packet has an EOM flag, which we read. Have you ever seen a token expiration error? Something like Login failed for user '<token-identified principal>'. Token is expired. ClientConnectionId:2787d61f-5f69-476c-8df7-ce83dbb3ec57? Really strange if you haven't.

alromos commented 1 year ago

@tkyc The driver is used by a Spring Boot application and a Logstash instance (Java + JRuby). The logs I posted above belong to the Logstash instance. I faced the issue with token expiration in Spring Boot application and it was fixed by adding msiTokenCacheTtl=0 as you mentioned. I was a part of this discussion: https://github.com/microsoft/mssql-jdbc/issues/1944

I didn't see the error in the Logstash instance logs. However I added the msiTokenCacheTtl=0 setting to the Logstash instance as well but it didn't help to resolve the connection hanging issue. You may see the following log entry in the file. The entry confirms the property is set to 0:

3/27/2023, 7:08:21.402 PM   com.microsoft.sqlserver.jdbc.internals.SQLServerDriver  DEBUG   Property:msiTokenCacheTtl Value:0
tkyc commented 1 year ago

I'll send you a custom 11.2.1.jre17 driver with additional logging to help shed some light on this issue when I'm able to. Will it be possible for you to run the custom driver jar in your environment (doesn't go against any policies etc..)? I want to add some more logging around the code block I shared earlier. Although, I still have some doubts about it being the readPacket call in the while loop that's causing it (b/c of EOM termination). I'll keep you posted, issue is still on my radar.

alromos commented 1 year ago

@tkyc I found the following statement in documentation of connection properties:

Query execution hangs and doesn't throw an exception if the TCP connection to the server is silently dropped.

So, I think it may be the case. I'm thinking about trying to set queryTimeout and cancelQueryTimeout properties to test this. Could you explain how to set the properties properly and how they are related? I just wonder why default values of queryTimeout and cancelQueryTimeout are to wait indefinitely? Won't such settings hide the hanging connections?

tkyc commented 1 year ago

@alromos Good point, I didn't think of the timeout settings.

queryTimeout + cancelQueryTimeout are in seconds eg. jdbc:sqlserver://...;queryTimeout=<seconds>;cancelQueryTimeout=<seconds>;

queryTimeout is the time in seconds the driver will wait between sending an execute command to the server and the server returning data before cancelling the command.

cancelQueryTimeout is the time in seconds the driver will wait for an acknowledgement of the queryTimeout cancellation from the server before forcefully terminating/closing the connection.

So, total time before the driver times out or closes the connection is queryTimeout + cancelQueryTimeout.

I just wonder why default values of queryTimeout and cancelQueryTimeout are to wait indefinitely? Won't such settings hide the hanging connections?

That's legacy behaviour and yes you're right about it potentially causing hangs. cancelQueryTimeout defaults to infinite because queryTimeout defaults to infinite. And, since cancelQueryTimeout results in a disconnected connection (which is pretty disruptive) it was likely better to wait for the server to respond eventually, so a late clean cancel instead of a disconnect.

tkyc commented 1 year ago

@alromos Sorry for the long hiatus, just jumping back again on this. Did the queryTimeout and cancelQueryTimeout help?

alromos commented 1 year ago

@tkyc I was verifying it all this time. Unfortunately setting just queryTimeout and both queryTimeput and cancelQueryTimeout didn't help to resolve the issue. Connection is still hanging at some point and I don't see any errors related to the timeout settings.

alromos commented 1 year ago

@tkyc Hi! Are there any updates on this? Did you have a chance to prepare a custom driver version with additional logging in place?

tkyc commented 1 year ago

@alromos sorry my attention has been divided, I'll have the driver for you sometime today.

tkyc commented 1 year ago

@alromos The additional logging I've added can be found here. I've set the logging level for these logs to INFO as the higher logging levels are very verbose and we're only looking in a specific area.

debug-logging-driver.zip

alromos commented 1 year ago

@tkyc Thank you! I've successfully integrated the driver you provided. I will provide the logs when I manage to reproduce the issue next time. Special thanks for setting logging level to INFO :)

tkyc commented 1 year ago

@alromos Np, again sorry for the delay. Looking forward to the logs to see if it really hangs in the loop.

alromos commented 1 year ago

@tkyc See the additional logs attached. I've also included logs of a successful execution preceding the hanging one. Looks like you were right and connection hangs in the while loop. terminated-pipeline-case-4.xlsx

tkyc commented 1 year ago

@alromos I've added more logging to confirm the EOM from the packet, could you swap out the driver with the one attached to this comment? If there really is no EOM to terminate on, I'd expect to see my new logging to be repeated. In the meantime, I'll need to read up on the packet parsing of the driver code.

Also, could you try setting socketTimeout in your connection string? By default, socketTimeout is infinite.

debug-logging-driver-p2.zip

tkyc commented 1 year ago

After getting the new logs from debug-logging-driver-p2.zip, could you also provide logs from a run with the 12.2.0 driver with FINEST logging enabled and also while setting socketTimeout? I just realized in the initial FINEST logging that was shared prior seems like it's missing FINE logs (maybe filtered out?), so for this new FINEST logs could you make sure FINE logs are included as well (in addition to the other log levels eg. finer, info...).

tkyc commented 1 year ago

@alromos Did the socketTimeout setting make a difference?

alromos commented 1 year ago

@tkyc I still need some additional time to test all your suggestions. I have only one environment to test this and it may take up to 7-8 days to reproduce the issue. Here is my current plan:

  1. 11.2.1-p2 driver + no socketTimeout - IN PROGRESS
  2. 11.2.1-p2 driver + socketTimeout - PENDING
  3. 12.2.0 driver + socketTimeout - PENDING

I'm waiting for the issue to reproduce with the first setup since May 4.

BTW, is it possible to have customized version of 12.2.0 driver as well? I'm asking this because verbose (FINEST) driver logging produces big amount of log entries and significantly affects our Azure costs. Actually, since we have some logging in the endless while loop in 11.2.1-p2 I think it also may produce a lot of logs when the issue happens and will cost us a lot.

lilgreenbird commented 1 year ago

@alromos

We think your issue may be fixed by setting socketTimeout to a non-zero value as that would terminate the read so the connection will drop and trigger the driver to reconnect (idle connection resiliency feature). The latest 12.2.0 driver has the latest fixes please move to testing scenario #3 (12.2.0 driver + socketTimeout) and see if that resolves your issue?

alromos commented 1 year ago

@tkyc I'm attaching logs of the first setup in case you want to investigate the issue further: terminated-pipeline-case-5.xlsx As you may see the logs that you've added are not repeated and the connection hangs after the first header packet read:

ConnectionID:121914 ClientConnectionId: 1dc99b90-a827-46a7-a7ce-7ad5d7dee9a4 github#2093 - Reading packet header...

@lilgreenbird I'm going to follow your suggestion with 12.2.0 driver version and socketTimeout setting and let you know if it works. However I'm not going to enable FINEST logging because of the reasons mentioned in my previous comment. I only have a question: is there any recommended value for socketTimeout setting? I don't see any recommendations in the official documentation. It would be nice to have some note about this. In the meanwhile I'm going to set it to 30000 (30 seconds) as recommended in the issue comment: socket read timeout during connection creation when a user-assigned managed identity is used

alromos commented 1 year ago

@tkyc @lilgreenbird I have an update related to 12.2.0 driver version. It looks like maven central repository does not contain jre17 release of the version: image So, I'm going to test socketTimeout setting with 12.3.0.jre17-preview version.

tkyc commented 1 year ago

@alromos Thanks for the logs. 30000 should be acceptable for the socketTimeout. Moving forward, drivers labeled jre11 are compatible/tested with JDKs 11+. So, 12.2.0.jre11 would be compatible with JDK 17.

BTW, is it possible to have customized version of 12.2.0 driver as well?

I'll prepare a customized 12.2.0 driver sometime today.

Logging changes can be seen here.

v12.2.0-custom-logging.zip

lilgreenbird commented 1 year ago

@alromos

In the recent versions we are only release jre8 and jre11 jars. The jre11 jar supports all jre11+ including 17. We accidentally released jre17 and jre20 in the 12.3.0 as well that should not have been there and is not necessary, sorry about the confusion.

tkyc commented 1 year ago

@alromos any news? I'm hoping the socketTimeout will help resolve the issue.

alromos commented 1 year ago

@tkyc @lilgreenbird Thank you for the jre17 label clarification!

Talking about the issue. Currently I'm running application with 12.3.0.jre17-preview version of your driver and socketTimeout setting in test environment. Application have been running with the setup since May 12, I don't see any hangs so far. Also I tried to enable socketTimeout setting for our current setup with 11.2.1.jre17 driver version in higher envs. The fix was applied on May 18 and I haven't seen hangs since then. In addition, during the timeline an exception related to the setting was raised:

com.microsoft.sqlserver.jdbc.SQLServerException: Read timed out

However it was thrown in slightly different place I expected to see it. See the stack trace in the attachment: read-timeout-stack-trace.txt

So, it looks like the changes you suggested fixed the issue we faced. I would like to monitor it for some additional time (2-3 weeks) to make sure we won't face the issue again.

Outstanding questions from my side:

  1. Is it correct that in scope of 12.x.x driver update the hanging code was completely re-written and now it relies on an external library (azure-identity)? I faced a missing class issue in 12.3.0.jre17-preview version because we were including only driver JAR in application classpath and I was forced to include com.azure:azure-identity JAR and all its dependencies to make everything work.
  2. When a next 12.x.x stable release is expected?
tkyc commented 1 year ago

@alromos

  1. That's right, Managed Identity in 12.x.x versions of the driver rely on azure-identity now to get the access token for authenticating. So, from 12.x.x onwards you'll need to explicitly include azure-identity as a dependency.

  2. That should happen after our next preview. Our next preview is on June 7th, but for our next stable release I'll need to double check confirm. I'll let you know when I get the exact date.

EDIT: Stable release will be on end of July

alromos commented 1 year ago

@tkyc Thank you for the answers! I think we may close the issue as resolved. In case of any further issues or concerns I will re-open it.

alromos commented 11 months ago

@tkyc I would like to re-open the issue. I've identified a case when setting socketTimeout is not sufficient for us. Sometimes I have long-running queries which exceed socketTimeout value and cause read timeout. As a workaround I increased the setting value to 3 minutes (was 30 seconds). However now the setting does not seems so efficient in identifying "hanging" connections since it requires 3 minutes to do it. That's why I would like to continue the investigation we started earlier.

I tried to enable additional logging with custom 12.2.0 JAR you provided however it produces too many logs. I cannot afford having so many logs in Azure environment since it will significantly increase our costs. Is there a chance you can revise the logging and keep INFO level for the most important log statements only? Thank you!

tkyc commented 11 months ago

I was talking to the team and we revisited our timeout logic. Right now, how the driver does timeouts doesn't look right to us. It isn't right that our current socketTimeout logic is affecting both your queries and login. I'm marking this issue as an enhancement to fix this. The general idea for the fix is to have the timeouts during login (where your hang occurs) be respected separately from the timeout that occurs when you run your long queries (so that you won't have to have 3min affecting both login and queries).

For now as a workaround, can you open connections with socketTImeout=3min for your longer running queries? And, for all other cases have it remain as 30s? Just so you can somewhat efficiently identify hang.

alromos commented 11 months ago

@tkyc Thank you for the update! It sounds promising.

Unfortunately at the moment we are stuck with setting socketTimeout to 3 minutes for ALL queries. The queries are used to read data for our reports. Usually they take a small amount of time because we read only the latest data updates. However in some cases we need to update the data structure for the reports and re-read ALL the report data. In this case the query execution takes up to 3 minutes. Since the data amount is constantly growing we will eventually need to increase the socketTimeout value in future so with the current approach it will keep slowing down identification of hangs.

Having separate timeouts for login and data reads will certainly resolve the issue. With the approach we won't need to care of keeping socketTimeout value up to date with the data volume (we will just remove the setting).

Jeffery-Wasty commented 2 months ago

Hi @alromos,

I believe this issue is resolved with the changes to socketTimeout for release 12.6.0. With these changes, socketTimeout is adjusted based on the status of the connection. So if loginTimeout is reached, as in the case of the above mentioned hanging connections, the connection is closed, regardless of how large socketTimeout is. But if login is successful, socketTimeout can still be set to a high value to account for long-running queries. If you're able to, we encourage you to check out the 12.6.1 stable release, and see if these recent changes address your scenario.

Jeffery-Wasty commented 2 months ago

I'm closing this issue as, as stated above, I think the recent changes address the preceding issue. We can reopen this if it turns out not to be the case.