rails / rails

Ruby on Rails
https://rubyonrails.org
MIT License
55.65k stars 21.57k forks source link

SQLite3Adapter test using ruby master branch got `Received cancellation signal` after 30 min #49841

Open yahonda opened 10 months ago

yahonda commented 10 months ago

Steps to reproduce

Unable to reproduce it locally. Here are

https://buildkite.com/rails/rails/builds/101101#018b5834-8aac-4e35-8ca5-33f964f18914 https://buildkite.com/rails/rails/builds/101132#018b5ece-c103-4dae-91a0-3608f85b4b6e https://buildkite.com/rails/rails/builds/101140#018b6002-8f41-4153-894e-db355691ecd8 https://buildkite.com/rails/rails/builds/101324#018b783b-f914-4b93-8b94-760f04dc6806 https://buildkite.com/rails/rails/builds/101326#018b7928-a571-4f7e-907a-68f1620d168d

Expected behavior

It should finish successfully in couple of minutes like https://buildkite.com/rails/rails/builds/101342#018b7d60-4941-4bf9-8243-03b4676ad4a4

Actual behavior

It gets # Received cancellation signal, interrupting` after 30 min running.

System configuration

Rails version: main branch

Ruby version: 3.3.0p-1 (2023-10-29 revision 7f2809b0a9db2a8a4a04aeaf91db191dee383574) [x86_64-linux]  This Ruby version used for the latest one https://buildkite.com/rails/rails/builds/101342#018b7d60-4941-4bf9-8243-03b4676ad4a4

fatkodima commented 10 months ago

There was also one for PostgreSQL https://buildkite.com/rails/rails/builds/101440#018b86a9-5c82-496f-b435-ee6cb19ebf1f, probably of the same reason.

If this is not reproducible, should we debug then on CI? Something like enabling verbose tests logging (prints test names) and use timeout somewhere (or at_exit signal) to print the backtraces?

zzak commented 10 months ago

I don't think anyone going to object to using CI to debug something like this, but you might run into permission issues like to approve a build, then just ping someone on discord.

fatkodima commented 10 months ago

Seems like there were no recent errors like this, so probably this can be closed.

yahonda commented 10 months ago

Sorry for my late reply. Somehow it still reproduces recently https://buildkite.com/rails/rails/builds/101853#018bc505-469a-4fde-8fe3-e3907515c3f0 let me open this issue some more time.

zzak commented 10 months ago

Just looking at this now.

Exited with status 255 (after intercepting the agent’s termination signal, sent because the job was canceled) (soft failed)

255 apparently means the agent was terminated: https://buildkite.com/docs/agent/v3#exit-codes

I'm going to ask on their slack to see if they might have more ideas. :pray:

zzak commented 9 months ago

After looking into this issue, my conclusion is that ReaperTest#test_idle_timeout_configuration is the culprit.

Disabling this test, I can no longer reproduce the failure, but I'm not sure how to actually fix it.

Details below.

How

My theory is that this test results in the process becoming unresponsive and eventually receiving a SIGKILL from docker.

In order to reproduce this issue, I've changed the CI to run a sqlite3 tests on ruby master 25 times, which you can see here: https://github.com/zzak/buildkite-config/compare/7be65339504ccc721235bb486557bb99a284a164...1527b0359a6dc3ae37d5de8431e4a27f293a5a62

The problem is visible in about 2-3 out of 25 jobs, for example.

With the test_idle_timeout_configuration test removed, you can see the build passes and even a rebuild worked just fine.

This is using the buildkite-config sandbox for testing, if you want to reproduce this:

  1. commit to rails/rails: zzak-debug-ci branch
  2. trigger a "New Build" on the config-sandbox pipeline
  3. set the branch as "debug-ci" and click "create build"
  4. click "Review build script"
  5. hit "OK"
  6. click "Build Rails main with new config" step once it starts

Just note that if you push to this branch, that will also trigger a build on the main (Rails) pipeline, so you will want to cancel that or you will end up waiting forever for agents to boot: https://buildkite.com/rails/rails/builds?branch=zzak-debug-ci

Investigate

This test was originally added in #33652, and was refactored in #43502 to use Process.clock_gettime(Process::CLOCK_MONOTONIC).

Curiously, this test did fail sporadically when trying to reproduce this issue locally and outside of docker:

Failure:
ActiveRecord::ConnectionAdapters::ConnectionPoolFiberTest#test_idle_timeout_configuration [/Users/zzak/c
ode/rails/activerecord/test/cases/connection_pool_test.rb:244]:
Expected: 0
  Actual: 1

Since they are both testing the same thing, I think they are related.

But I'm hoping this information gives people some ideas, since I'm out of them, and also wanted to share context with how I tested this.

Below the fold more debugging steps

:writing_hand: For anyone willing to help debug this, and doesn't want to (or have access) to push to the branch on rails/rails ## Timeouts I've dialed down the timeout to 5 minutes, since this job typically finishes in under 3. Also, I've bumped the `docker-compose` buildkite plugin, as there was a similar issue that was reported in buildkite-plugins/docker-compose-buildkite-plugin#389. While this did not help with debugging the issue, I've noticed that retries are actually working with the latest version, so the problem can "self-heal". Essentially, this version of the plugin will properly call `docker stop` on the container once it exceeds the timeout window for the step, in our case 30 minutes. This will send a `SIGTERM` until a timeout eventually sending `SIGKILL`: https://docs.docker.com/engine/reference/commandline/stop/ I was able to reproduce the problem locally as well, just in normal docker and buildkite-config. However, I gave up on trying to capture the signal and `puts caller`. Instead, dropped [minitest-reporters](https://github.com/minitest-reporters/minitest-reporters) in the Gemfile to print all of the test cases while they were being executed. When using the same `SEED` lead to this test always being the last one run when the problem was reproduced. ### Setup the test environment 1. `git clone --depth=1 --branch=zzak-debug-ci https://github.com/rails/rails` 2. `cd rails` 3. `git clone https://github.com/rails/buildkite-config .buildkite` ### Required steps for Apple Silicon Skip if on x86 (only applies to arm64 / apple silicon). * Remove the `entrypoint` from `docker-compose.yml` * Remove the `curl ... await-linux-amd64 && install` from `Dockerfile` * Remove `ADD .buildkite/await-all` and `chmod /await-all` from `Dockerfile` * Make sure you KEEP `runner` in there This is until rails/buildkite-config#42 is merged. ### Build the base image From the `rails` checkout directory: ``` docker build -t bk-base --platform linux/amd64 --build-arg RUBY_IMAGE=rubylang/ruby:master-nightly-jammy -f .buildkite/Dockerfile . ``` This can take a while and the image may exceed 2gb, FYI. ### Run the container Once the image is built, I've just opened 10 windows and ran this command at the same time: ``` RUBY_YJIT_ENABLE="1" RUBY_IMAGE="rubylang/ruby:master-nightly-jammy" IMAGE_NAME=bk-base docker-compose -f .buildkite/docker-compose.yml run default runner activerecord 'rake sqlite3:test' ``` Eventually one of them will get stuck. ### Killing the process You can try `docker stop` or send any signals to try and get the `Signal.trap` we put in place, but I could never get it to hit when using docker. If you run the same test outside of docker, you can just use `kill` with the proper signal and it will work, you can catch the signal. But I was unable to reproduce the process hanging outside of docker, as noted above.

Thanks! :bow:

yahonda commented 9 months ago

Thank you for the detailed investigation. I still have no idea how to fix it.

yahonda commented 7 months ago

https://github.com/rails/rails/pull/51038 adds timeout.

rails-bot[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not been commented on for at least three months. The resources of the Rails team are limited, and so we are asking for your help. If you can still reproduce this error on the 7-2-stable branch or on main, please reply with all of the information you have about it in order to keep the issue open. Thank you for all your contributions.

yahonda commented 4 months ago

This issue still exists.

rails-bot[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not been commented on for at least three months. The resources of the Rails team are limited, and so we are asking for your help. If you can still reproduce this error on the 7-2-stable branch or on main, please reply with all of the information you have about it in order to keep the issue open. Thank you for all your contributions.

fatkodima commented 1 month ago

Still valid.