ruby / timeout

Timeout provides a way to auto-terminate a potentially long-running operation if it hasn't finished in a fixed amount of time.
Other
143 stars 21 forks source link

Idea: a standard "shutdown signal" #48

Closed jjb closed 4 months ago

jjb commented 4 months ago

To time out gracefully, the inner code needs to be told stop doing work. graphql-ruby has such a timeout

This can already be done with Timeout using 2 blocks, with the inner one sending a special exception, as shown below. This doesn't solve things like "what if the timeout exception is raised inside an else block", but I think the venn diagram is such that if ShutdownRequest is caught and the code comes back, it's safe.

require 'timeout'

class ShutdownRequest < StandardError; end
class ShutdownHandledGracefully < StandardError; end

def do_work
  1000.times do |n|
    puts n
    sleep 1
  rescue ShutdownRequest
    puts "shuting down gracefully..."
    raise ShutdownHandledGracefully
  end
ensure
  puts "cleaning up resources..."
  sleep 2
  puts "...done"
end

Timeout.timeout(3, ShutdownRequest) do
  do_work
rescue ShutdownHandledGracefully
  puts "nice."
end

puts
puts

# give it 1 second to try to shutdown gracefully, then force it
begin
  Timeout.timeout(4) do
    Timeout.timeout(3, ShutdownRequest) do
      do_work
    end
  end
rescue ShutdownHandledGracefully
  puts "nice."
rescue Timeout::Error
  puts "shutdown was not graceful, need to kill the process so there aren't broken resources..."
end
➔ ruby shutdown.rb
0
1
2
shuting down gracefully...
cleaning up resources...
...done
nice.

0
1
2
shuting down gracefully...
cleaning up resources...
shutdown was not graceful, need to kill the process so there aren't broken resources...

This is often not practical because the code we want to time out is many layers of libraries we don't control, and can't accommodate our special shutdown method.

If this pattern is indeed safe, maybe the timeout gem could formalize it, by:

a. defining its own Shutdown class b. allowing specification of both "shutdown signal time" and "strict time", which would implement the double layer pattern shown above

Timeout.timeout_with_grace_period(soft: 3, hard: 5){ }

Then, libraries and application code could support this concept. For example, rails ActiveRecord::Relation#find_each could rescue ShutdownRequest, then raise ShutdownHandledGracefully. if the outer code doesn't catch that, that's good - it will roll back a transaction and work through any chain of ensures so things are cleaned up. if it all happens within the soft time, then the state is clean. If not, then the process should be killed. (as with pitchfork or rack-timeout term_on_timeout.

If there is some reluctance to code the hard/soft functionality, but the pattern is sound, maybe as a first step, the class could simply be defined. Timeout::Shutdown - this gives implementers at all levels the opportunity to start supporting the pattern, referencing the same class.

jjb commented 4 months ago

/cc @mohamedhafez @mperham @headius

jjb commented 4 months ago

I showed this to @bensheldon and he had this additional idea

  loop do
    break if Timeout.time_remaining < 1
    do_work
  end

my thought, this could be supported either in standard mode or a no-raise mode

Timeout.timeout(5) do # or, timeout_without_raise, or something
  loop do
    break if Timeout.time_remaining < 1
    do_work
  end
end
eregon commented 4 months ago

Which version of timeout are you using? In the latest version an exception is used to "interrupt" the block, and not throw anymore. The exception is either: https://github.com/ruby/timeout/blob/a7e91a5666c09854e7d9b5e066debeb00a56b344/lib/timeout.rb#L29-L30 or the one passed to Timeout.timeout.

As mentioned, Timeout::ExitException is kind of internal and an implementation detail. The idea is if you want to catch the timeout exception inside the block, then use Timeout.timeout(n, MyException) { ... }.

Taking your example, why is this not enough?

require 'timeout'

class ShutdownRequest < StandardError; end

def do_work
  1000.times do |n|
    puts n
    sleep 1
  rescue ShutdownRequest => e
    puts "shuting down gracefully..."
    raise e
  end
ensure
  puts "cleaning up resources..."
  sleep 2
  puts "...done"
end

Timeout.timeout(3, ShutdownRequest) do
  do_work
rescue ShutdownRequest
  puts "nice."
end

puts
puts

begin
    Timeout.timeout(3, ShutdownRequest) do
      do_work
    end
rescue ShutdownRequest
  puts "nice."
end

The output is:

0
1
2
shuting down gracefully...
cleaning up resources...
...done
nice.

0
1
2
shuting down gracefully...
cleaning up resources...
...done
nice.

This is often not practical because the code we want to time out is many layers of libraries we don't control, and can't accommodate our special shutdown method. If this pattern is indeed safe, maybe the timeout gem could formalize it, by: a. defining its own Shutdown class

Is the issue to do extra logic on Timeout and a far caller does Timeout.timeout(n) { ... } (so no exception class)? If so would documenting Timeout::ExitException and make it non-internal be a solution here? I think that can be considered.

eregon commented 4 months ago

If it's about having a common exception for both Timeout.timeout(n) { ... } and Timeout.timeout(n, MyException) { ... } that seems unfortunately very difficult for compatibility, because I suspect some code might already depend on the second one raising a MyException inside the block and not just outside.

jjb commented 4 months ago

version: 0.4.1

why is this not enough?

in your new code it looks like you removed ShutdownHandledGracefully - i did a bad job naming this, it's not central to the proposal, it's a stand-in for something an application might do - a better name might have been Rollback - however, an even more realistic/better example is what you did with raise e - the application code raises the "reason" for the interruption, which is indeed the shutdown/exit request. So, bringing that all together to this simplified example, leveraging Timeout::ExitException:

require 'timeout'

def do_work
  1000.times do |n|
    puts n
    sleep 1
  rescue Timeout::ExitException => e
    puts "shuting down gracefully..."
    raise e
  end
ensure
  puts "cleaning up resources..."
  sleep 2
  puts "...done"
end

Timeout.timeout(3) do
  do_work
rescue Timeout::Error
  puts "process is not in a healthy state, need to kill the process so there aren't broken resources..."
rescue Exception # this is what an error reporter would do
  puts "work was not completed, but process is in a healthy state"
end

puts
puts

# give it 1 second to try to shutdown gracefully, then force it
begin
  Timeout.timeout(4) do
    Timeout.timeout(3) do
      do_work
    end
  end
rescue Timeout::Error
  puts "process is not in a healthy state, need to kill the process so there aren't broken resources..."
rescue Exception # this is what an error reporter would do
  puts "work was not completed, but process is in a healthy state"
end

So, this is nice because:

a. it's more simple b. leverages existing timeout behavior c. has the overall behavior i want

however, here is some application code that doesn't have the behavior i want - what if other layers of application or library code catch Timeout::ExitException - now the second one gets caught, and the "hard" limit is not enforced. (i put the new behavior inside the ensure in order to make the diff between the examples smaller - more realistic would be multiple loops, each with their own rescue and cleanup code which takes time).

require 'timeout'

def do_work
  1000.times do |n|
    puts n
    sleep 1
  rescue Timeout::ExitException => e
    puts "shuting down gracefully..."
    raise e
  end
ensure
  begin
    puts "cleaning up resources..."
    sleep 2
    puts "...done"
  rescue Timeout::ExitException
    puts "rescued inside the ensure. phew! now to clean up, for 9000 seconds..."
    sleep 9000
  end
end

Timeout.timeout(3) do
  do_work
rescue Timeout::Error
  puts "process is not in a healthy state, need to kill the process so there aren't broken resources..."
rescue Exception # this is what an error reporter would do
  puts "work was not completed, but process is in a healthy state"
end

puts
puts

# give it 1 second to try to shutdown gracefully, then force it
begin
  Timeout.timeout(4) do
    Timeout.timeout(3) do
      do_work
    end
  end
rescue Timeout::Error
  puts "process is not in a healthy state, need to kill the process so there aren't broken resources..."
rescue Exception # this is what an error reporter would do
  puts "work was not completed, but process is in a healthy state"
end

would documenting Timeout::ExitException and make it non-internal be a solution here? I think that can be considered.

that sounds pretty cool and it's neat to discover i could make the more simple example above- but as shown in the 2nd example above, i think the semantics of what's going on are not sufficient. we need a standalone "please try to shutdown gracefully" signal.

If it's about having a common exception for both Timeout.timeout(n) { ... } and Timeout.timeout(n, MyException) { ... } that seems unfortunately very difficult for compatibility, because I suspect some code might already depend on the second one raising a MyException inside the block and not just outside.

IIUC, i believe your concern is not relevant to my proposal. my proposal would probably require a very different and new style of invocation. (e.g. Timeout.timeout_with_grace_period(soft: 3, hard: 5){ } from my OP). Libraries could choose to support it, just in case the outer code is using it. This new invocation would send the special Shutdown exception after soft seconds, then have behavior identical to current Timeout after hard seconds. The same api features of current Timeout could be supported with keyword arguments.

eregon commented 4 months ago

what if other layers of application or library code catch Timeout::ExitException

Then it's a clear bug of that application/library code. They should never catch Timeout::ExitException. Just like doing something blocking for a while in ensure is bad practice and can cause issues/hangs/second Ctrl+C/etc.


I read your whole post but I am still not entirely sure why you want a soft exception and a hard exception. If the hard exception is caught, that's a bug and there is nothing we can do (well, timeout used to use throw but that has other problems). If the soft exception is caught, I think that's also a bug.

I guess the soft/hard is about handling code in rescue/ensure which misbehaves and blocks for a while. But then it's not clear the hard exception wouldn't cause the same problem (i.e. the next rescue/ensure might misbehave too). And even if it did help, it causes another problem, code in some ensure/rescue was interrupted so now we have doubly-inconsistent state (and the second is worse, it means some code in ensure/rescue wasn't run). It feels to me if you want/need a hard timeout, then SIGKILL is the way to go (SIGTERM wouldn't help, it's treated as a SignalException). (there is also exit and exit! but they also depend on rescue/ensure not misbehaving, so not helping)

eregon commented 4 months ago

Regarding https://github.com/ruby/timeout/issues/48#issuecomment-2095435224 that looks like a safe way to interrupt (which might of course not work if not enough checks) but it feels something that can be easily handled without Timeout, i.e. just comparing some Process.clock_gettime(Process::CLOCK_MONOTONIC) values. Maybe something for documentation rather?

jjb commented 4 months ago

I read your whole post but I am still not entirely sure why you want a soft exception and a hard exception.

Thanks for reading and your response. Wow, I'm really not being clear 😅. Below I'll explain my goal, why I'm now realizing my idea is probably not tenable, and why maybe Ben's idea is.

I am trying to enable a concept of "safe shutdown" across the ruby ecosystem. Examples of this:

So, a loop checks on each iteration "should i keep going?".

If there were a "universal shutdown exception class" in ruby, loops could rescue it and halt work. Timeout could raise this exception in the thread, and check if the exception bubbled all the way back up. If it does, then Timeout will assume all the inner code is behaving well. As I'm writing this I'm realizing this isn't tenable for the same reason as @headius wrote about 15 years ago, the exception can be raised anywhere including inside an ensure. To support my model (or some other solution), ruby-core would need to encode special behavior in ensure, probably for ruby 4. (which is probably possible and I should start proposing that elsewhere).

Meanwhile, Ben's idea seems pretty awesome (i'll add the "soft" concept)

loop do
  break unless Timeout.soft_time_remaining > 0
  do_work
end

it feels something that can be easily handled without Timeout, i.e. just comparing some Process.clock_gettime(Process::CLOCK_MONOTONIC) values.

But what I'm trying to do is make a standard way to support a concept of "external signal to shut down" - so activerecord find_each could implement this, and then rails users could pick whatever timeout value they like from the outside.

what if other layers of application or library code catch Timeout::ExitException

Then it's a clear bug of that application/library code. They should never catch Timeout::ExitException.

I only proposed that because I thought you were suggested it 😆 when you said "would documenting Timeout::ExitException and make it non-internal be a solution here?". I agree that code should not catch this internal exception.

eregon commented 4 months ago

To support my model (or some other solution), ruby-core would need to encode special behavior in ensure, probably for ruby 4. (which is probably possible and I should start proposing that elsewhere).

There is already a ruby-core issue about this which you might find interesting: https://bugs.ruby-lang.org/issues/17849

I only proposed that because I thought you were suggested it 😆 when you said "would documenting Timeout::ExitException and make it non-internal be a solution here?". I agree that code should not catch this internal exception.

To be clear, by "catch Timeout::ExitException" I meant rescue it and not re-raise it. I think it's totally fine to rescue it if it is re-raise it after, if one needs special cleanup on timeout and that code can't just be in an ensure (well except that's it's an internal name that might change, but unlikely and we could stop documenting it as if it's going to change).

Re Timeout.soft_time_remaining > 0, how would this work if there are nested Timeout.timeout usages? That sounds rather tricky to implement (but I have not tried it). I think that functionality should be a predicate for simplicity, like Timeout.soft_timed_out?, but of course same issue regarding nesting (although then it's just a boolean, so could maybe be stored as a fiber-local and saved/restored).

jjb commented 4 months ago

Wow, thanks for pointing me to that ruby core discussion!

how would this work if there are nested Timeout.timeout usages

good point, invocations would maybe be aware of one another and do some appropriate math

@mperham in a slack discussion mentioned go Contexts

considering all of the above, i think my idea doesn't need to and probably shouldn't live in Timeout - it can stand on its own. I'll make a little POC and see what folks think.

Thanks for the discussion!