Improve handling of network related timeouts

bvogelzang commented 3 years ago

Summary

In cases of network failure with a command batch running - configured timeouts were not being respected. This meant that the time elapsed before a TinyTds::Error was thrown would extend well beyond the configured timeout #445.

The reason for this is due to the fact that dbcancel does not work under these conditions. The tinytds_err_handler is called when the timeout is reached, we return INT_TIMEOUT and attempt to cancel the command batch using dbcancel. That call fails which means we continue to wait for the command batch to return even when subsequent timeout errors are reported.

Ideally to fix this we would detect if dbcancel fails and return INT_CANCEL in these circumstances. Unfortunately, we can't do this given that dbcancel will always return SUCCEED

As an alternative, when a timeout occurs, we set a flag and use that to return INT_CANCEL on the next timeout error we encounter. This allows the pending command batch to return and nogvl_cleanup can raise the ruby exception. This works well except for the fact that it will take two timeout periods to elapse before sending INT_CANCEL. One reported timeout error to set the flag and another one to act on it. To get around this we can use dbsetinterrupt to check the flag periodically while waiting on a read from the server, return INT_CANCEL sooner, and raise the timeout exception.

This shouldn't have any affect on normal timeout conditions due to the fact that dbcancel will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situations we still want the db process to remain in working condition.

Fixes #445.

Notes

I can't come up with a good way to write a test for this scenario. I would love suggestions on how we might approach this.
Returning INT_CANCEL means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?
In my testing, the interrupt was called once per second. This means the timeout exception will be raised 1 second after hitting the configured timeout. I have no idea what the interrupt interval is in other environments so it is possible that this could vary.
Other solutions I considered were trying to kill the thread spawned by rb_thread_call_without_gvl and raising the error directly from the error handler using rb_thread_call_with_gvl but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.

aharpervc commented 3 years ago

This looks like a great contribution, very nice.

I can't come up with a good way to write a test for this scenario. I would love suggestions on how we might approach this.

How did you test it during development?

Returning INT_CANCEL means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?

It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.

In my testing, the interrupt was called once per second. This means the timeout exception will be raised 1 second after hitting the configured timeout. I have no idea what the interrupt interval is in other environments so it is possible that this could vary.

Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?

Other solutions I considered were trying to kill the thread spawned by rb_thread_call_without_gvl and raising the error directly from the error handler using rb_thread_call_with_gvl but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.

I think your current solution seems reasonable. What do you mean "normal db processing" though?

bvogelzang commented 3 years ago

How did you test it during development?

I had a scratch script that would use pfctl to drop the connection.

IO.write('/etc/pf.conf', File.open('/etc/pf.conf') { |f| f.read.gsub(/#block drop/, "block drop") })
system('pfctl -f /etc/pf.conf > /dev/null 2>&1')

Something like that isn't going to be cross platform and required a little bit of manual setup. If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case. I doubt it's desirable to drop network across the whole system just to run a test though so it would ideally work for just the test host.

It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.

I'll take an initial stab at drafting something for the readme

Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?

Initially I couldn't find it but it looks like freetds defines this at 1 second: https://github.com/FreeTDS/freetds/blob/cfa161498673d91a963d173561cbc99835da1cfa/src/tds/net.c#L622. Which means it should be consistent.

What do you mean "normal db processing" though?

I mean in cases where the timeout is triggered while waiting for results from the db and a network connection is up. e.g. the timeout is configured for 1 second but sql waitfor delay 00:00:02 causes a timeout.

aharpervc commented 3 years ago

If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case.

SQL Server is running in a container, right? Does pausing the container after creating a connection have the same effect? Or maybe we can do something like this to pause the docker network temporarily: https://stackoverflow.com/a/56218103

We can also conditionally skip the test based on env vars, etc. I think it would be nice to have something even if it's silly and shells out to docker.

bvogelzang commented 3 years ago

Great suggestion. Pausing the container does have a similar effect. I've gone ahead and refactored an existing test we were previously skipping altogether. If you have a better suggestion for the skip logic I'd be happy to incorporate it.

bvogelzang commented 3 years ago

These changes are now available to test in 2.1.4.pre

aharpervc commented 3 years ago

These changes are now available to test in 2.1.4.pre

This is probably fine for what it is, however we still have some timeout related problems. For example, when I run this scratch script and kill sql server, the main thread appears to hang forever instead of raising a timeout error.

run a sql server container
run this scratch script
immediately stop the sql server container
counter thread runs forever, main thread hangs instead of timeout

require 'tiny_tds'

def w(*args)
  puts [Thread.current.object_id, *args].join ' '
end

t = Thread.new do
  i = 0
  loop do
    i += 1
    w i
    sleep(1)
  end
end

c = TinyTds::Client.new({
  host: 'localhost',
  username: 'sa',
  password: 'Testing123@@',

  timeout: 5,
  login_timeout: 5,

  message_handler: ->(e) { w e },
})

w TinyTds::VERSION

sql = <<~SQL
create or alter procedure tinytds_TestPrintWithError
as
begin
    raiserror('raiserror', 1, 1) with nowait
    waitfor delay '00:01:00'
end
SQL

w c.execute(sql).to_a
w c.execute('exec tinytds_TestPrintWithError').to_a

t.exit

$ bundle exec ruby .\scratch.rb
23804380 1
4649200 2.1.4.pre
4649200
23804380 2
23804380 3
23804380 4
23804380 5
23804380 6
23804380 7
23804380 8
23804380 9
23804380 10
23804380 11
23804380 12
23804380 13
23804380 14
23804380 15
(...snip...)

I'm not sure if this particular problem should be addressed by this PR, or we should just remember it's still a problem and tackle it separately.

sixfeetover commented 3 years ago

One suggestion for how to test this is to use ToxiProxy. It's a tool for simulating different kinds of network failure. You can boot toxiproxy container, connect to mssql through it, and then kill the connection, add jitter or latency, etc.

bvogelzang commented 3 years ago

I was unable to reproduce using your scratch script. I used ToxiProxy to test network failures (slow close, timeout, and down) as well as manually pausing/stopping my sqlserver docker container. In all instances I hit a timeout at the expected point (at 6 seconds).

aharpervc commented 3 years ago

These changes look good to me. We've been testing 2.1.4-pre2 in production for a week or so without any ill effect from this PR.

Can you squash/edit your commits down to whatever's appropriate for master history before merging?

rails-sqlserver / tiny_tds

Improve handling of network related timeouts #481

Summary

Notes