Closed headius closed 2 weeks ago
@eregon @andrykonchin The specs added by TruffleRuby will need to be fixed. What should we do about the Thread#wakeup spec that passes for me locally on macOS but fails on CI macOS?
Re 1, anything relying on sleep is extremely brittle and will likely cause transients.
So if there is a way to write it without sleep
that may be fine, but otherwise such a transient spec cannot be added to ruby/spec (as it would fail in CI too often).
If it passes reliably on Linux you could guard it with platform_is :linux do
.
One idea to make it more reliable would be to measure how long sleep(0.001) takes, if it takes less than 1ms then stop and good and otherwise retry a number of times.
BTW, https://github.com/jruby/jruby/pull/8332/files#diff-6d95ecaf7e0384b05112f05fb26b61332b3dfd92d664a6202a34b9e48ec4ebdb seems unrelated to Thread#wakeup
and seems about Kernel#sleep
instead (the spec seems in the wrong file).
So far the person syncing the specs also fixed the failures, so for 2 and 3 I guess they are trivial fixes like adding .dup
before .force_encoding
for 2.
For 3, run bin/rubocop -a
and for the rest add an exclude in .rubocop_todo.yml or disable the cop entirely in .rubocop.yml (when RuboCop gets in the way better disable the cop than fight it).
As a note https://github.com/ruby/spec/pull/1211 disables Lint/RescueException so that's no longer a concern.
Fixed issues with core/encoding/compatible_spec.rb
file.
@eregon:
anything relying on sleep is extremely brittle and will likely cause transients
It doesn't appear to be transient, though. It consistently passes on Windows and Linux and consistently fails on MacOS.
And it's hard to spec sleep
without relying on sleep
.
seems unrelated to Thread#wakeup and seems about Kernel#sleep instead
Agreed, I will move it.
If it passes reliably on Linux you could guard it
I will do this for now and continue investigating the CI MacOS failures separately.
So far the person syncing the specs also fixed the failures
I think it would be better going forward if the person that introduced a breaking change corrects that change.
@andrykonchin Thank you for correcting those issues!
It doesn't appear to be transient, though. It consistently passes on Windows and Linux and consistently fails on MacOS.
As I suspected, it fails in TruffleRuby's CI on Linux, both on CRuby and TruffleRuby. I think we should just remove the spec if it's not made more stable (e.g. with the idea in https://github.com/ruby/spec/pull/1210#issuecomment-2459184776). I'll quarantine it now in truffleruby's copy.
And it also has caused failures in CRuby: https://github.com/ruby/ruby/commit/7c893c8d6bbd1c84b3081d95aac3f8d69f8c20fb So, yeah, any transient spec relying on sleep timing will be removed or quarantined, it's almost guaranteed to fail in my long-term experience with those. If it fails on one platform due to timing it's very likely to fail on other platforms too.
Have you considered the possibility that it might actually just be broken in CRuby and TruffleRuby? It has been passing in JRuby's CI consistently for several weeks.
Regardless, it is behavior that was important enough for @ioquatix to file issues with both TruffleRuby and JRuby. Perhaps he can weigh in on how to verify the behavior in a more reliable way.
I should also point out that in the pr where I am investigating the failure, CRuby isn't just off a little bit. It's off by an order of magnitude (8-35x off from expected). Something is causing a 0.1ms sleep to take way longer on Darwin. This is not just an inconvenience for testing. I think it exposes a real problem.
man nanosleep
on Linux makes it pretty clear there are no guarantees (e.g. nanosleep() suspends the execution of the calling thread until either at least the time specified in *req has elapsed, or ...
) and on a busy system sleep calls take longer than what's asked.
Also I guess JRuby uses java.lang.Thread#sleep(long, int)
like TruffleRuby does.
Anyway, feel free to try to improve this spec or check CRuby's sleep impl. The issue is adding a transiently failing spec is causing havoc everywhere.
And I'm saying it wasn't transiently failing when I merged it over. It was consistently passing on Linux and Windows and consistently failing on Darwin by a wide margin. I went with your suggestion to leave it enabled only on the passing platforms and opened a draft PR to investigate further.
I don't care if it is quarantined for now but I believe it is wrong to dismiss the failures as being due solely to unreliable sleep times. Being off by an order of magnitude is not just unreliable, it's broken.
IIRC, the issue was sleep(x)
for some values of x
< 1ms (again, IIRC), was the same as sleep(0)
on JRuby and TruffleRuby due to the internal implementation truncating values. If used in a loop, this behaviour can cause the loop to execute far more rapidly than expected. I think the expectation is that sleep(x)
sleeps for >= a wall clock time of x
.
If your sleep(x)
is sleeping for less than x
, or consistently more than 10x
, I think the implementation may have issues. It's true that a over-subscribed system may put a chunky factor on sleep(x)
, so it's not good to rely in sleep(x) <= fx
where f is some factor like 2 or 3. In practice, as has been discussed and shown by tests time and time again, it's extremely likely to be problematic (i.e. "flaky tests")
(actual_duration > 0.01).should == true # 100 * 0.0001 => 0.01
(actual_duration < 1).should == true
The first assertion seems totally reasonable to me, although technically it could be >=
if we are pedantic.
The second assertion is frustratingly problematic. On an oversubscribed system, sleep
could be out by a large factor 10-100x, especially on GitHub runners, and especially the macOS runners, which seem to be far more over-subscribed in practice, leading to a lot of scheduling issues.
The only reasonable way I've found to deal with this is to try and detect how over-subscribed the system is and try to compensate for it, when dealing with these kinds of tests. The other option is to simply remove assertions on upper bounds. Trying to add a multiplier of 100x is still bound to be flaky on some edge cases (bell curve etc).
As a counter point, on a normal under-subscribed system (not overloaded), sleep(x)
should be within some reasonable factor of x
, maybe at most 10%. It's also true, there are sometimes bugs in the implementation that cause that factor to be large (invalid scheduling, problematic implementation, etc). Therefore, this kind of test will catch those kinds of problems. So I do believe that there is value in such tests, just that it's extremely hard to express without being flaky.
Bear in mind that adding a 100x factor may also hide bugs... In other words, the assertion, when run on an under-subscribed system was better for detecting issues. It's an extremely tricky issue. Maybe we need to differentiate between "hard failure" (assertion) and some kind of informational "something looks wrong".
(Sorry, I'm not sure this really helps much, I've been here multiple times with async
's scheduler and I understand the pain and frustration).
@ioquatix thank you for the input! I agree that the excessively long sleep times on MRI seem like they're a problem to be investigated. For now, I have simply removed the < 0.03
check since the purpose of this spec and fix was to check that sleeps were long enough, not whether they are too long.
I won't have time to look further into why MRI failed the 0.03 check but I think somebody should.
@ioquatix that makes a lot of sense and matches my experiences as well. @headius removed the upper bound in https://github.com/ruby/spec/pull/1212/files so that should no longer fail transiently based on how oversubscribed is the system.
@headius Did you get some conflicts when rebasing specs? Anything non-trivial? Sometimes it can be not so easy to resolve and typically one needs to look at the changes to that file on 2 branches to see what each commit meant to do.
the excessively long sleep times on MRI seem like they're a problem to be investigated
Is this when running locally, e.g. under-subscribed system, or when running in CI, where we can expect it to be "problematic"?
FWIW in https://github.com/oracle/truffleruby/issues/2716#issuecomment-1235607492 I made some measurements. It looks like sub-millisecond sleeps are very imprecise, on all Ruby implementations, most likely because the OS doesn't give much guarantees there.
Did you get some conflicts when rebasing specs? Anything non-trivial?
@eregon There were a few small conflicts, I think both TruffleRuby and CRuby added the same specs. Nothing that was difficult to resolve.
Is this when running locally, e.g. under-subscribed system, or when running in CI, where we can expect it to be "problematic"?
@ioquatix Locally for me on MacOS and in CI on Linux and Windows it passed consistently. CI on MacOS consistently failed, with 100 sleeps of 0.1ms taking anywhere from 80ms to 350ms.
Apparently TruffleRuby also had issues.
JRuby passes it consistently in CI on all platforms we test (mostly Linux).
It looks like sub-millisecond sleeps are very imprecise
Perhaps I'm looking at it wrong, but it seems like only TruffleRuby is having difficulties there. JRuby, now fixed, always has a quantum under in the range of 0.003 to 0.005 on my M1 Mac.
It's also hard to say it's the sleep that's imprecise in your testing. If it warms up does it stabilize for TR?
CRuby is always around 0.003 except when it GCs (looping over the example 10k times).
It's the same for all Rubies, in theory they should all print 0.10...
but they don't (this is on a pretty idle Linux desktop):
$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
0.15385151699956623
0.1538250559988228
0.1538307459995849
$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
jruby 9.4.9.0-SNAPSHOT (3.1.4) 2024-11-04 746b84aa02 OpenJDK 64-Bit Server VM 17.0.12+7 on 17.0.12+7 +jit [x86_64-linux]
0.16409351599941147
0.15885490400069102
0.15931887199985795
$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM Native [x86_64-linux]
0.16984848700121802
0.16197076400021615
0.16043854000054125
$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
0.16442835199995898
0.16147058500064304
0.15863075099878188
Using strace -T
(+ --follow-forks
on JRuby), they all do:
futex(0x822210, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000016>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488181439}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000148>
futex(0x822210, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000015>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488404312}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000148>
futex(0x822210, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000015>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488626766}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000149>
futex(0x822210, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000016>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488848929}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000149>
Notice how each sleep is too long by about 50% / 0.00005s.
My take on this is we are not using real-time operating systems, so sleep() is at least as much as passed, with pretty much no guaranteed upper bound under load or when the sleep amount is small (e.g. below 1ms).
Nice analysis! It's great to see the original problem I had (sleep(small x) = no op
) gone and all platforms behaving very similarly.
There are three kinds of failures here that were not caught by the script:
--enable-frozen-string-literal
flag (hundreds of these all fail the same way).