socketry / async

An awesome asynchronous event-driven reactor for Ruby.
MIT License
2.04k stars 85 forks source link

Properly managing interrupts (works for async v 1.31, "breaks" for async v 2.5.6) #291

Closed viktorvsk closed 6 months ago

viktorvsk commented 7 months ago

Hi! It seems to me I'm missing something basic/fundamental in Async expected workflow, but after finding this issue https://github.com/socketry/async/issues/220 I suspect maybe its some sort of a bug so I decided to ask it here.

I decided to extract the simplest reproducible example out of my app. I'm using mac and constantly getting issues there with ruby 3.2.2 and async 2.6.5. I decided to quickly check it on ubuntu, but there I have ruby 3.0.2 so it installs async 1.31. And with this setup, ruby 3.0.2, async 1.31 and my extracted app works as expected (thats why I also assume it may be a bug in async 2+).

But maybe I'm missing something basic and you could help me understand it? Here is the gist of the simplified app. Start it with ruby app.rb given pre-requisits:

  1. Redis on default localhost:6379
  2. async-websocket and redis gems installed

In short, app should run infinite loop where it sometimes fetches new state from redis. In this state there are servers configs that websocket should connect to and do some work. Each loop tick we check what servers are active now (for simplified reasons its emulated to always have one real server that belongs to me), we stop those which are running but no more active and add active servers to "currently running". When a server is stopped/deactivated, I want to run some cleanup i.e. drop connection, manage state in redis, stop Async task.

The problem is, on_close behavior is not consistent. And it happens because of this line REDIS.ping (in real app I do different calculations, but any call to Redis breaks expected behavior). If I remove the call to redis and add different ruby internal calculations, all works fine. I suspect, its I/O calls causing issues.

The inconsistency looks like the following - when I CTRL+C, sometimes all work as expected, sometimes silently stops on a line with REDIS.ping. It seems like it depends on WHEN you interrupt it with CTRL+C, but thats just my assumption. Maybe I'm missing some basic concepts with Sync/Async, nested tasks or global/shared variables?

Please let me know if I can share more debugging info

UPDATE

I tested this simplified app on ubuntu using ruby 3.2.2 (ruby is in docker but it seems it should not have any differences). With the latest async 2.6.5 I get the same behavior as on mac - inconsistent interruptions. However, with gem "async", "~> 1" in Gemfile everything works as expected. Here is the Gemfile.lock used where it works:

GEM
  remote: https://rubygems.org/
  specs:
    async (1.31.0)
      console (~> 1.10)
      nio4r (~> 2.3)
      timers (~> 4.1)
    async-http (0.61.0)
      async (>= 1.25)
      async-io (>= 1.28)
      async-pool (>= 0.2)
      protocol-http (~> 0.25.0)
      protocol-http1 (~> 0.16.0)
      protocol-http2 (~> 0.15.0)
      traces (>= 0.10.0)
    async-io (1.38.0)
      async
    async-pool (0.4.0)
      async (>= 1.25)
    async-websocket (0.25.1)
      async-http (~> 0.54)
      async-io (~> 1.23)
      protocol-rack (~> 0.1)
      protocol-websocket (~> 0.11)
    connection_pool (2.4.1)
    console (1.23.2)
      fiber-annotation
      fiber-local
    fiber-annotation (0.2.0)
    fiber-local (1.0.0)
    nio4r (2.6.1)
    protocol-hpack (1.4.2)
    protocol-http (0.25.0)
    protocol-http1 (0.16.0)
      protocol-http (~> 0.22)
    protocol-http2 (0.15.1)
      protocol-hpack (~> 1.4)
      protocol-http (~> 0.18)
    protocol-rack (0.2.6)
      protocol-http (~> 0.23)
      rack (>= 1.0)
    protocol-websocket (0.12.1)
      protocol-http (~> 0.2)
    rack (3.0.8)
    redis (5.0.8)
      redis-client (>= 0.17.0)
    redis-client (0.18.0)
      connection_pool
    timers (4.3.5)
    traces (0.11.1)

PLATFORMS
  x86_64-linux

DEPENDENCIES
  async (~> 1)
  async-websocket
  redis

RUBY VERSION
   ruby 3.2.2p53

BUNDLED WITH
   2.4.10
ioquatix commented 7 months ago

Thanks, I’ll take a look.

ioquatix commented 6 months ago

I investigated this issue.

As it currently stands, when you press Ctrl-C, the event loop is interrupted and exits. This causes #scheduler_close to be invoked which uses #terminate to forcefully exit all tasks.

The terminate mechanism invokes stop on all tasks until they complete. Invoking a blocking operation from an ensure block in this case will yield back to the event loop which will then attempt to stop the task again.

I made a PR to change Interrupt to invoke #stop. This will let your program gracefully shut down. It's possible that badly behaving code could cause the event loop to hang while exiting, e.g.

Async do
  # ... normal code ...
ensure
  sleep # forever
end

In this case, however, pressing Ctrl-C a 2nd time will exit the ensure block. I think this is okay. In other words, either we have a robust termination, or we allow user code to indefinitely hang the event loop on exit. In practice, it will mean that SIGTERM would be used to kill the program without graceful shutdown. Which I think is okay.

viktorvsk commented 5 months ago

Thanks for a quick fix and the detailed explanation!