rails-sqlserver / tiny_tds

TinyTDS - Simple and fast FreeTDS bindings for Ruby using DB-Library.
Other
605 stars 191 forks source link

Looping on "Adaptive Server connection timed out" (Azure SQL) #138

Closed augustj closed 10 years ago

augustj commented 10 years ago

Running FreeTDS 0.91 with latest tiny_tds on Azure SQL. Normal operation is perfect. However, when the app has been idle (~5-10 minutes), the database connection goes bad. The db service sends back => 20003 "Adaptive Server connection timed out". Then FreeTDS/Tiny TDS go in a loop with the server, responding with INT_CONTINUE. The server responds again with the 20003 error. They back and forth until the unicorn master is fed up and reaps the process.

I have no timeouts set in the freetds.conf file. It is empty except to activate the logging. I am running on Ubuntu 12.04, Rails 3.2.

Here is the FreeTDS log for a failed attempt:

1:17:27.063192 48918 (dblib.c:4880):dbdead(0x2e5ecb0) [alive] 21:17:27.072366 48918 (dblib.c:1312):dbcmd(0x2e5ecb0, EXEC sp_executesql N'SELECT TOP (1) [users].* FROM [users] WHERE [users].[id] = @0', N'@0 int', @0 = 1) 21:17:27.072434 48918 (dblib.c:1319):dbcmd() bufsz = 148 21:17:27.072464 48918 (dblib.c:5882):dbfreebuf(0x2e5ecb0) 21:17:27.072512 48918 (dblib.c:6862):dbsqlsend(0x2e5ecb0) 21:17:27.072540 48918 (mem.c:615):tds_free_all_results() 21:17:27.072567 48918 (util.c:156):Changed query state from IDLE to QUERYING 21:17:27.072597 48918 (write.c:140):tds_put_string converting 102 bytes of "EXEC sp_executesql N'SELECT TOP (1) [users].* FROM [users] WHERE [users].[id] = @0', N'@0 int', @0 = 1" 21:17:27.072882 48918 (write.c:168):tds_put_string wrote 204 bytes 21:17:27.072911 48918 (util.c:156):Changed query state from QUERYING to PENDING 21:17:27.072938 48918 (net.c:741):Sending packet 21:17:27.073066 48918 (net.c:1240):in tds_push_func 21:17:27.073208 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:27.073251 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:27.073278 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:27.073304 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:27.073331 48918 (dblib.c:4639):dbsqlok(0x2e5ecb0) 21:17:27.073357 48918 (dblib.c:4669):dbsqlok() not done, calling tds_process_tokens() 21:17:27.073382 48918 (token.c:540):tds_process_tokens(0x5794180, 0x7fff43cd8bc8, 0x7fff43cd8bcc, 0x6914) 21:17:27.073446 48918 (util.c:156):Changed query state from PENDING to READING 21:17:27.073476 48918 (net.c:1199):in tds_pull_func 21:17:37.129602 48918 (util.c:331):tdserror(0x46232c0, 0x5794180, 20003, 0) 21:17:37.129670 48918 (dblib.c:7929):dbperror(0x2e5ecb0, 20003, 0) 21:17:37.129689 48918 (dblib.c:7981):20003: "Adaptive Server connection timed out" 21:17:37.129697 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:37.129718 48918 (dblib.c:4880):dbdead(0x2e5ecb0) [alive] 21:17:37.129726 48918 (dblib.c:3196):dbcancel(0x2e5ecb0) 21:17:37.129733 48918 (query.c:2155):tds_send_cancel: not in_cancel and not idle 21:17:37.129740 48918 (query.c:2164):tds_send_cancel: sending cancel packet 21:17:37.129747 48918 (net.c:741):Sending packet 21:17:37.129790 48918 (net.c:1240):in tds_push_func 21:17:37.129813 48918 (dblib.c:8002):"Adaptive Server connection timed out", client returns 1 (INT_CONTINUE) 21:17:37.129834 48918 (util.c:361):tdserror: client library returned TDS_INT_CONTINUE(1) 21:17:37.129842 48918 (util.c:384):tdserror: returning TDS_INT_CONTINUE(1) 21:17:47.076371 48918 (util.c:331):tdserror(0x46232c0, 0x5794180, 20003, 4) 21:17:47.076413 48918 (dblib.c:7929):dbperror(0x2e5ecb0, 20003, 4) 21:17:47.076427 48918 (dblib.c:7981):20003: "Adaptive Server connection timed out" 21:17:47.076437 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:47.076446 48918 (dblib.c:8002):"Adaptive Server connection timed out", client returns 1 (INT_CONTINUE) 21:17:47.076455 48918 (util.c:361):tdserror: client library returned TDS_INT_CONTINUE(1) 21:17:47.076463 48918 (util.c:384):tdserror: returning TDS_INT_CONTINUE(1) 21:17:47.076489 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:47.076501 48918 (dblib.c:3196):dbcancel(0x2e5ecb0) 21:17:47.076510 48918 (query.c:2155):tds_send_cancel: in_cancel and not idle 21:17:47.180191 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:47.180249 48918 (dblib.c:3196):dbcancel(0x2e5ecb0) 21:17:47.180263 48918 (query.c:2155):tds_send_cancel: in_cancel and not idle 21:17:57.126580 48918 (util.c:331):tdserror(0x46232c0, 0x5794180, 20003, 4) 21:17:57.126689 48918 (dblib.c:7929):dbperror(0x2e5ecb0, 20003, 4) 21:17:57.126726 48918 (dblib.c:7981):20003: "Adaptive Server connection timed out" 21:17:57.126760 48918 (dblib.c:5780):dbgetuserdata(0x2e5ecb0) 21:17:57.126793 48918 (dblib.c:8002):"Adaptive Server connection timed out", client returns 1 (INT_CONTINUE) 21:17:57.126826 48918 (util.c:361):tdserror: client library returned TDS_INT_CONTINUE(1) 21:17:57.126985 48918 (util.c:384):tdserror: returning TDS_INT_CONTINUE(1)

metaskills commented 10 years ago

Have you tried the latest master branch?

augustj commented 10 years ago

Yes - I have tried against master. As well as the fork w/ pull request on https://github.com/veracross/tiny_tdsfrom wbond that fixes some timeout issues.

On Wed, Oct 16, 2013 at 4:59 PM, Ken Collins notifications@github.comwrote:

Have you tried the latest master branch?

— Reply to this email directly or view it on GitHubhttps://github.com/rails-sqlserver/tiny_tds/issues/138#issuecomment-26469625 .

metaskills commented 10 years ago

Yes - I have tried against master. As well as the fork w/ pull request on...

Did you mean the work under #136 which I have not yet merged? Just wanted to be sure. If so, that was my next recommendation. I have heard about this on the FreeTDS list before too, could this be an issue with them? I have also heard about it on standard SQL Server over a general network or VPN. Those discussions went towards keep alive settings. Maybe give this issue a read over and let me know what you think?

https://github.com/rails-sqlserver/tiny_tds/issues/5

augustj commented 10 years ago

Yes, I have tried that (#136) as well. Unfortunately, the connection is also closed during active use - not just inactivity. I will look into the FreeTDS list and see if I find anything. ... I didn't see anything in the FreeTDS archives, so I posted to them. I'll let you know what I hear.

sontek commented 10 years ago

@augustj Do you have a small test case that could easily reproduce this? I'm seeing this issue in prod as well.

augustj commented 10 years ago

As far as a test case, I was never able to get a reproducible case - we ended up forking tiny_tds and putting in some hardcoded behavior to not keep retrying on timeouts. We then forked activerecord-retry to look for that specific error and retry. It was a terrible solution - one that left a bad taste in my mouth. That project was put on hold for other reasons, so I never pursued it. My conclusion was that it wasn't necessarily a tiny_tds issue, but wackyness in Azure SQL. Sorry I don't have better news.

I know people are using this gem in Azure - I wonder if they are using 'real' SQLServer instances instead of Azure SQL.

metaskills commented 10 years ago

@augustj Sorry to hear about that. If it is something in TinyTDS, would love to get it fixed.

augustj commented 10 years ago

I am not sure it is a tiny_tds issue. If I remember correctly, Azure SQL reports a timeout - but it is effectively a dead connection.

sontek commented 10 years ago

I've dug into this a little today, I don't believe its any problem with Azure SQL, but for some reason, a dead connection from any SQL Server is reported as a timeout. You need to actually check DEADPROC in the err_handler and then return INT_EXIT instead

augustj commented 10 years ago

That makes sense - although I confess that my knowledge of SQL Server connections is scant. With that extra piece of info about DEADPROC, it seems like there could be a concrete test case - and a solution. From my reading on MS docs about Azure SQL, they say that it is basically flaky and could cut out at any time - and provide code/solutions in .Net to make catching/retrying easy.

On Mon, Dec 9, 2013 at 1:55 PM, John Anderson notifications@github.comwrote:

I've dug into this a little today, I don't believe its any problem with Azure SQL, but for some reason, a dead connection from any SQL Server is reported as a timeout. You need to actually check DEADPROC in the err_handler and then return INT_EXIT instead

— Reply to this email directly or view it on GitHubhttps://github.com/rails-sqlserver/tiny_tds/issues/138#issuecomment-30177688 .

augustj commented 10 years ago

Our other issue was that a connection from the pool would only appear dead once we tried to use it. That was why we ended up using the activerecord-retry gem. If there is was to check the connection on checkout from the pool, that would help. I know that is often frowned upon, but Azure SQL connection die so often, they should be checked every time.

On Mon, Dec 9, 2013 at 1:59 PM, August Jaenicke august@carbonfive.comwrote:

That makes sense - although I confess that my knowledge of SQL Server connections is scant. With that extra piece of info about DEADPROC, it seems like there could be a concrete test case - and a solution. From my reading on MS docs about Azure SQL, they say that it is basically flaky and could cut out at any time - and provide code/solutions in .Net to make catching/retrying easy.

On Mon, Dec 9, 2013 at 1:55 PM, John Anderson notifications@github.comwrote:

I've dug into this a little today, I don't believe its any problem with Azure SQL, but for some reason, a dead connection from any SQL Server is reported as a timeout. You need to actually check DEADPROC in the err_handler and then return INT_EXIT instead

— Reply to this email directly or view it on GitHubhttps://github.com/rails-sqlserver/tiny_tds/issues/138#issuecomment-30177688 .

Altonymous commented 10 years ago

We also seem to be suffering from this. Any updates?

Altonymous commented 10 years ago

@augustj, any chance you can post what you did? We are still suffering from this and struggling to find a solution.

augustj commented 10 years ago

@Altonymous - my two hacks still did not result in a useable system. The project was shelved for other reasons. However, if the project was to continue, we would have needed to find a real solution. My hacks only improved the experience from unusable to merely excruciating. I'd recommend pursuing @sontek's hypothesis.

Altonymous commented 10 years ago

I checked out a copy of the activerecord-sqlserver-adapter.. and it looks like it's getting to SQLServerAdapter->active? and when it calls @connection.active? that always returns true. Even when the connection is not active.

I have to run my kid to soccer practice.. otherwise I'd try and dig into the tiny_tds gem to see why that is. Perhaps someone more experienced in the tiny_tds would be able to take it from here. If not I'll be back at it after soccer practice.

Altonymous commented 10 years ago

For anyone else having this problem, that just needs a work-around for the time being, I just created an initializer to keep the connection alive...

# db_keep_alive.rb
thr = Thread.new {
  loop do
    ActiveRecord::Base.connection_pool.with_connection do |conn|
      User.first
    end

    sleep 30
  end
}

at_exit do
   thr.kill
end

It's ugly, but it works.

pd40 commented 10 years ago

I am seeing this issue with Azure SQL + FreeTDS 7.1 + tiny_tds 0.6.2 (built using master on April 23 2014) which includes #136 running on an Azure hosted Suse instance. To reproduce just leave an Azure Sql connection idle for a few minutes using ActiveRecord then make a call.

I see the same messages @augustj reported; the debug messages freetds.log show 20003: "Adaptive Server connection timed out"

@sontek suggestion about returning INT_EXIT on SYBETIME (20003) is probably a good one. At least it might be the only option for Azure Sql. Could SYBETIME be recovered from with a normal non-Azure connection?

#define SYBEFCON        20002   /* SQL Server connection failed. */
#define SYBETIME        20003   /* SQL Server connection timed out. */
pd40 commented 10 years ago

Taking out the logic for handling SYBETIME in the query timeout case and forcing INT_EXIT in client.c improves the behavior for this Azure Sql case.

case SYBETIME: {
   // Existing userdata->continue logic taken out
   return_value = INT_EXIT; 
   break; 

This changes the behavior to the following in the freetds.log

dblib.c:4669:dbsqlok() not done, calling tds_process_tokens()
token.c:540:tds_process_tokens(0x5a39ee0, 0x7fc8086ba144, 0x7fc8086ba140, 0x6914)
util.c:156:Changed query state from PENDING to READING
net.c:1199:in tds_pull_func
util.c:331:tdserror(0x4b67d50, 0x5a39ee0, 20003, 0)
dblib.c:7929:dbperror(0x45806c0, 20003, 0)
dblib.c:7981:20003: "Adaptive Server connection timed out"
dblib.c:5780:dbgetuserdata(0x45806c0)
dblib.c:8002:"Adaptive Server connection timed out", client returns 0 (INT_EXIT)
dblib.c:8050:FreeTDS: db-lib: exiting because client error handler returned INT_EXIT for msgno 20003

But there is still a timeout delay before ActiveRecord connects. It is a much shorter delay than 5~10 minutes but still not great. If we knew the connection is always going to do this on Azure Sql then it might be a valid change for the Azure Sql case.

pd40 commented 10 years ago

Created pull request #158 with a change in behavior for SYBETIME(20003) if the connection uses Azure SQL. I tested this using an existing rails Azure SQL application - but not tested with rake yet.

pd40 commented 10 years ago

Created pull request #159 which is less disruptive to the process and allows the rails server to reconnect and continue. I have no access to a Sql Server test environment at the moment so the change is not tested with rake yet.

wbond commented 10 years ago

I just made a release at https://github.com/veracross/tiny_tds/releases/tag/0.6.2-beta1 which should include a fix for this. Can you guys try it out and let me know how it works?

pd40 commented 10 years ago

Thanks Will. I will hopefully have an environment to test it out today. Stay tuned...

wbond commented 10 years ago

This should be resolved by the 0.6.2 release that I just pushed up to rubygems.

lethunder commented 10 years ago

i'm not able actually to install the pre beta gem i have this message LoadError cannot load such file -- tiny_tds

lethunder commented 10 years ago

i've install the pre beta version but i'm still facing timeout issues

wbond commented 10 years ago

@lethunder The pre-beta is no longer necessary. Version 0.6.2 was pushed to rubygems around May 19th.

lethunder commented 10 years ago

i've tested both i still have timeout issues :(

wbond commented 10 years ago

@lethunder Sounds like you should open a new issue and include info such as:

  1. Step to reproduce
  2. FreeTDS dump of connection
  3. What version of SQL Server
  4. What version of tiny_tds

With that info, someone may be able to help you. It may, however, come down to diving in a figuring it out yourself.

D1plo1d commented 9 years ago

I'm still having time out issues after 0.6.2 here as well. Here's my info:

Web Server

freetds (freetds-dev package on ubuntu 12.04.4 LTS): 0.91-1 tiny_tds: 0.6.2 ubuntu: 12.04.4 LTS ruby: 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux]

SQL Server

windows: Microsoft Windows Server 2012 R2 sql server: SQL Server Standard 2012 Service Pack 1

wbond commented 9 years ago

@D1plo1d This is a closed issue. I'd probably recommend opening a new one with a FreeTDS dump.

Then you'll just have to wait for someone to spend the time to dig through it all.

petervandenabeele commented 9 years ago

Hi,

Sorry to post on a closed issue (but this seems like relevant to what we see).

A few days ago, we saw a series of 15 TimeOuts during a 2 minute period.

During that period,

A typical log entry (production was INFO log level, so we do not have the DEBUG log level data) is:

All SELECT calls (not shown in the production INFO log do work), the first INSERT fails:

I, [2014-10-09T09:50:45.233755 #27243]  INFO -- : Started PUT "/api/aanvragen/2052" for x.x.x.x at 2014-10-09 09:50:45 +0200
I, [2014-10-09T09:50:45.244516 #27243]  INFO -- : Processing by Api::V1::AanvragenController#update as JSON
I, [2014-10-09T09:50:45.244811 #27243]  INFO -- :   Parameters: {"payload"=> ...
...
(many GET calls here, not shown in production log)
...

E, [2014-10-09T09:51:00.284821 #27243] ERROR -- : TinyTds::Error: Adaptive Server connection timed out: EXEC sp_executesql N'INSERT INTO [api_calls] ([centrum_id], [created_at], [entity], [method], [payload], [updated_at], [uri]) VALUES (@0, @1, @2, @3, @4, @5, @6); SELECT CAST(SCOPE_IDENTITY() AS bigint) AS Ident', N'@0 int, @1 datetime, @2 nvarchar(255), @3 nvarchar(255), @4 nvarchar(max), @5 datetime, @6 nvarchar(255)', @0 = 73, @1 = '2014-10-09T07:50:45.261',
I, [2014-10-09T09:51:00.298191 #27243]  INFO -- : CONNECTION RETRY: ActiveRecord::ConnectionAdapters::SQLServerAdapter retry #0.
I, [2014-10-09T09:51:00.304308 #27243]  INFO -- : Completed 500 Internal Server Error in 15059ms
F, [2014-10-09T09:51:00.328699 #27243] FATAL -- : 

What I have on info is:

$ dpkg -l '*freetds*' | grep ^ii
ii  freetds-bin                       0.82-7                                  FreeTDS command-line utilities
ii  freetds-common                    0.82-7                                  configuration files for FreeTDS SQL client libraries
ii  freetds-dev                       0.82-7                                  MS SQL and Sybase client library (static libs and headers)

GIT
  remote: https://github.com/veracross/tiny_tds.git
  revision: 937c352d7a760c4b65e5cc581b299e9f63132da3
  specs:
    tiny_tds (0.6.2)

SQL Server 2012 (I have no further details at this time).

So, there is a first call that waits for 15 seconds for a TimeOut. After that, a CONNECTION RETRY is tried and then 50 ms later, the system dies.

We recently upgraded from an old Rails 3.2 set-up to Rails 4. But it is also only recent that we started to run serious production volumes and the API and watch the production log in detail for these issues.

Any hints would be greatly appreciated.

Are there specific signs that we should search for in the SQL Server logs to determine if this is the issue we are facing ?