Open yahonda opened 10 months ago
There was also one for PostgreSQL https://buildkite.com/rails/rails/builds/101440#018b86a9-5c82-496f-b435-ee6cb19ebf1f, probably of the same reason.
If this is not reproducible, should we debug then on CI? Something like enabling verbose tests logging (prints test names) and use timeout somewhere (or at_exit
signal) to print the backtraces?
I don't think anyone going to object to using CI to debug something like this, but you might run into permission issues like to approve a build, then just ping someone on discord.
Seems like there were no recent errors like this, so probably this can be closed.
Sorry for my late reply. Somehow it still reproduces recently https://buildkite.com/rails/rails/builds/101853#018bc505-469a-4fde-8fe3-e3907515c3f0 let me open this issue some more time.
Just looking at this now.
Exited with status 255 (after intercepting the agent’s termination signal, sent because the job was canceled) (soft failed)
255 apparently means the agent was terminated: https://buildkite.com/docs/agent/v3#exit-codes
I'm going to ask on their slack to see if they might have more ideas. :pray:
After looking into this issue, my conclusion is that ReaperTest#test_idle_timeout_configuration is the culprit.
Disabling this test, I can no longer reproduce the failure, but I'm not sure how to actually fix it.
Details below.
My theory is that this test results in the process becoming unresponsive and eventually receiving a SIGKILL from docker.
In order to reproduce this issue, I've changed the CI to run a sqlite3 tests on ruby master 25 times, which you can see here: https://github.com/zzak/buildkite-config/compare/7be65339504ccc721235bb486557bb99a284a164...1527b0359a6dc3ae37d5de8431e4a27f293a5a62
The problem is visible in about 2-3 out of 25 jobs, for example.
With the test_idle_timeout_configuration
test removed, you can see the build passes and even a rebuild worked just fine.
This is using the buildkite-config sandbox for testing, if you want to reproduce this:
Just note that if you push to this branch, that will also trigger a build on the main (Rails) pipeline, so you will want to cancel that or you will end up waiting forever for agents to boot: https://buildkite.com/rails/rails/builds?branch=zzak-debug-ci
This test was originally added in #33652, and was refactored in #43502 to use Process.clock_gettime(Process::CLOCK_MONOTONIC)
.
Curiously, this test did fail sporadically when trying to reproduce this issue locally and outside of docker:
Failure:
ActiveRecord::ConnectionAdapters::ConnectionPoolFiberTest#test_idle_timeout_configuration [/Users/zzak/c
ode/rails/activerecord/test/cases/connection_pool_test.rb:244]:
Expected: 0
Actual: 1
Since they are both testing the same thing, I think they are related.
But I'm hoping this information gives people some ideas, since I'm out of them, and also wanted to share context with how I tested this.
Thanks! :bow:
Thank you for the detailed investigation. I still have no idea how to fix it.
https://github.com/rails/rails/pull/51038 adds timeout.
This issue has been automatically marked as stale because it has not been commented on for at least three months.
The resources of the Rails team are limited, and so we are asking for your help.
If you can still reproduce this error on the 7-2-stable
branch or on main
, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions.
This issue still exists.
Another CI that was running out 30 min https://buildkite.com/rails/rails/builds/107264#018f7da2-875a-437c-aa9e-d8a3b114135e
Get SIGABRT as modified via https://github.com/rails/rails/pull/51038 https://buildkite.com/rails/rails/builds/107261#018f7d9c-3b15-4f7a-9f53-9dfaa6e38dfa/1182-1191
This issue has been automatically marked as stale because it has not been commented on for at least three months.
The resources of the Rails team are limited, and so we are asking for your help.
If you can still reproduce this error on the 7-2-stable
branch or on main
, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions.
Still valid.
Steps to reproduce
Unable to reproduce it locally. Here are
https://buildkite.com/rails/rails/builds/101101#018b5834-8aac-4e35-8ca5-33f964f18914 https://buildkite.com/rails/rails/builds/101132#018b5ece-c103-4dae-91a0-3608f85b4b6e https://buildkite.com/rails/rails/builds/101140#018b6002-8f41-4153-894e-db355691ecd8 https://buildkite.com/rails/rails/builds/101324#018b783b-f914-4b93-8b94-760f04dc6806 https://buildkite.com/rails/rails/builds/101326#018b7928-a571-4f7e-907a-68f1620d168d
Expected behavior
It should finish successfully in couple of minutes like https://buildkite.com/rails/rails/builds/101342#018b7d60-4941-4bf9-8243-03b4676ad4a4
Actual behavior
It gets # Received cancellation signal, interrupting` after 30 min running.
System configuration
Rails version: main branch
Ruby version: 3.3.0p-1 (2023-10-29 revision 7f2809b0a9db2a8a4a04aeaf91db191dee383574) [x86_64-linux] This Ruby version used for the latest one https://buildkite.com/rails/rails/builds/101342#018b7d60-4941-4bf9-8243-03b4676ad4a4