ruby / spec

The Ruby Spec Suite aka ruby/spec
MIT License
600 stars 389 forks source link

Spec update 20241105 #1210

Closed headius closed 2 weeks ago

headius commented 2 weeks ago

There are three kinds of failures here that were not caught by the script:

  1. Thread#wakeup spec added by JRuby in https://github.com/jruby/jruby/pull/8332. Passes for me locally on macOS, fails on CI macOS... not sure how to reconcile:
    1)
    Thread#wakeup sleeps with nanosecond precision FAILED
    Expected false == true
    to be truthy but was false
    /Users/runner/work/spec/spec/core/thread/wakeup_spec.rb:17:in `block (2 levels) in <top (required)>'
    /Users/runner/work/spec/spec/core/thread/wakeup_spec.rb:5:in `<top (required)>'
  2. Encoding compatibility matrix specs added by @andrykonchin from TruffleRuby are not compatible with the --enable-frozen-string-literal flag (hundreds of these all fail the same way).
    1)
    Encoding.compatible? String, String compatibility matrix returns US-ASCII for "" in US-ASCII and "" in US-ASCII ERROR
    FrozenError: can't modify frozen String: ""
    /home/runner/work/spec/spec/core/encoding/compatible_spec.rb:534:in `force_encoding'
    /home/runner/work/spec/spec/core/encoding/compatible_spec.rb:534:in `block (4 levels) in <top (required)>'
    /home/runner/work/spec/spec/core/encoding/compatible_spec.rb:7:in `<top (required)>'
  3. Rubocop failures from specs added by @andrykonchin from TruffleRuby:
    core/encoding/compatible_spec.rb:171:1: C: [Correctable] Style/BlockComments: Do not use block comments.
    =begin ...
    ^^^^^^
    core/module/name_spec.rb:151:1: C: [Correctable] Layout/TrailingWhitespace: Trailing whitespace detected.
    core/module/name_spec.rb:155:1: C: [Correctable] Layout/TrailingWhitespace: Trailing whitespace detected.
    optional/capi/kernel_spec.rb:94:9: W: Lint/RescueException: Avoid rescuing the Exception class. Perhaps you meant to 
    rescue StandardError?
           rescue Exception ...
           ^^^^^^^^^^^^^^^^
headius commented 2 weeks ago

@eregon @andrykonchin The specs added by TruffleRuby will need to be fixed. What should we do about the Thread#wakeup spec that passes for me locally on macOS but fails on CI macOS?

eregon commented 2 weeks ago

Re 1, anything relying on sleep is extremely brittle and will likely cause transients. So if there is a way to write it without sleep that may be fine, but otherwise such a transient spec cannot be added to ruby/spec (as it would fail in CI too often). If it passes reliably on Linux you could guard it with platform_is :linux do.

One idea to make it more reliable would be to measure how long sleep(0.001) takes, if it takes less than 1ms then stop and good and otherwise retry a number of times.

BTW, https://github.com/jruby/jruby/pull/8332/files#diff-6d95ecaf7e0384b05112f05fb26b61332b3dfd92d664a6202a34b9e48ec4ebdb seems unrelated to Thread#wakeup and seems about Kernel#sleep instead (the spec seems in the wrong file).

eregon commented 2 weeks ago

So far the person syncing the specs also fixed the failures, so for 2 and 3 I guess they are trivial fixes like adding .dup before .force_encoding for 2. For 3, run bin/rubocop -a and for the rest add an exclude in .rubocop_todo.yml or disable the cop entirely in .rubocop.yml (when RuboCop gets in the way better disable the cop than fight it).

eregon commented 2 weeks ago

As a note https://github.com/ruby/spec/pull/1211 disables Lint/RescueException so that's no longer a concern.

andrykonchin commented 2 weeks ago

Fixed issues with core/encoding/compatible_spec.rb file.

headius commented 2 weeks ago

@eregon:

anything relying on sleep is extremely brittle and will likely cause transients

It doesn't appear to be transient, though. It consistently passes on Windows and Linux and consistently fails on MacOS.

And it's hard to spec sleep without relying on sleep.

seems unrelated to Thread#wakeup and seems about Kernel#sleep instead

Agreed, I will move it.

If it passes reliably on Linux you could guard it

I will do this for now and continue investigating the CI MacOS failures separately.

So far the person syncing the specs also fixed the failures

I think it would be better going forward if the person that introduced a breaking change corrects that change.

@andrykonchin Thank you for correcting those issues!

eregon commented 2 weeks ago

It doesn't appear to be transient, though. It consistently passes on Windows and Linux and consistently fails on MacOS.

As I suspected, it fails in TruffleRuby's CI on Linux, both on CRuby and TruffleRuby. I think we should just remove the spec if it's not made more stable (e.g. with the idea in https://github.com/ruby/spec/pull/1210#issuecomment-2459184776). I'll quarantine it now in truffleruby's copy.

eregon commented 2 weeks ago

And it also has caused failures in CRuby: https://github.com/ruby/ruby/commit/7c893c8d6bbd1c84b3081d95aac3f8d69f8c20fb So, yeah, any transient spec relying on sleep timing will be removed or quarantined, it's almost guaranteed to fail in my long-term experience with those. If it fails on one platform due to timing it's very likely to fail on other platforms too.

headius commented 2 weeks ago

Have you considered the possibility that it might actually just be broken in CRuby and TruffleRuby? It has been passing in JRuby's CI consistently for several weeks.

Regardless, it is behavior that was important enough for @ioquatix to file issues with both TruffleRuby and JRuby. Perhaps he can weigh in on how to verify the behavior in a more reliable way.

headius commented 2 weeks ago

I should also point out that in the pr where I am investigating the failure, CRuby isn't just off a little bit. It's off by an order of magnitude (8-35x off from expected). Something is causing a 0.1ms sleep to take way longer on Darwin. This is not just an inconvenience for testing. I think it exposes a real problem.

eregon commented 2 weeks ago

man nanosleep on Linux makes it pretty clear there are no guarantees (e.g. nanosleep() suspends the execution of the calling thread until either at least the time specified in *req has elapsed, or ...) and on a busy system sleep calls take longer than what's asked. Also I guess JRuby uses java.lang.Thread#sleep(long, int) like TruffleRuby does.

Anyway, feel free to try to improve this spec or check CRuby's sleep impl. The issue is adding a transiently failing spec is causing havoc everywhere.

headius commented 2 weeks ago

And I'm saying it wasn't transiently failing when I merged it over. It was consistently passing on Linux and Windows and consistently failing on Darwin by a wide margin. I went with your suggestion to leave it enabled only on the passing platforms and opened a draft PR to investigate further.

I don't care if it is quarantined for now but I believe it is wrong to dismiss the failures as being due solely to unreliable sleep times. Being off by an order of magnitude is not just unreliable, it's broken.

ioquatix commented 2 weeks ago

IIRC, the issue was sleep(x) for some values of x < 1ms (again, IIRC), was the same as sleep(0) on JRuby and TruffleRuby due to the internal implementation truncating values. If used in a loop, this behaviour can cause the loop to execute far more rapidly than expected. I think the expectation is that sleep(x) sleeps for >= a wall clock time of x.

If your sleep(x) is sleeping for less than x, or consistently more than 10x, I think the implementation may have issues. It's true that a over-subscribed system may put a chunky factor on sleep(x), so it's not good to rely in sleep(x) <= fx where f is some factor like 2 or 3. In practice, as has been discussed and shown by tests time and time again, it's extremely likely to be problematic (i.e. "flaky tests")

      (actual_duration > 0.01).should == true # 100 * 0.0001 => 0.01
      (actual_duration < 1).should == true

The first assertion seems totally reasonable to me, although technically it could be >= if we are pedantic.

The second assertion is frustratingly problematic. On an oversubscribed system, sleep could be out by a large factor 10-100x, especially on GitHub runners, and especially the macOS runners, which seem to be far more over-subscribed in practice, leading to a lot of scheduling issues.

The only reasonable way I've found to deal with this is to try and detect how over-subscribed the system is and try to compensate for it, when dealing with these kinds of tests. The other option is to simply remove assertions on upper bounds. Trying to add a multiplier of 100x is still bound to be flaky on some edge cases (bell curve etc).

As a counter point, on a normal under-subscribed system (not overloaded), sleep(x) should be within some reasonable factor of x, maybe at most 10%. It's also true, there are sometimes bugs in the implementation that cause that factor to be large (invalid scheduling, problematic implementation, etc). Therefore, this kind of test will catch those kinds of problems. So I do believe that there is value in such tests, just that it's extremely hard to express without being flaky.

Bear in mind that adding a 100x factor may also hide bugs... In other words, the assertion, when run on an under-subscribed system was better for detecting issues. It's an extremely tricky issue. Maybe we need to differentiate between "hard failure" (assertion) and some kind of informational "something looks wrong".

(Sorry, I'm not sure this really helps much, I've been here multiple times with async's scheduler and I understand the pain and frustration).

headius commented 2 weeks ago

@ioquatix thank you for the input! I agree that the excessively long sleep times on MRI seem like they're a problem to be investigated. For now, I have simply removed the < 0.03 check since the purpose of this spec and fix was to check that sleeps were long enough, not whether they are too long.

I won't have time to look further into why MRI failed the 0.03 check but I think somebody should.

eregon commented 2 weeks ago

@ioquatix that makes a lot of sense and matches my experiences as well. @headius removed the upper bound in https://github.com/ruby/spec/pull/1212/files so that should no longer fail transiently based on how oversubscribed is the system.


@headius Did you get some conflicts when rebasing specs? Anything non-trivial? Sometimes it can be not so easy to resolve and typically one needs to look at the changes to that file on 2 branches to see what each commit meant to do.

ioquatix commented 2 weeks ago

the excessively long sleep times on MRI seem like they're a problem to be investigated

Is this when running locally, e.g. under-subscribed system, or when running in CI, where we can expect it to be "problematic"?

eregon commented 2 weeks ago

FWIW in https://github.com/oracle/truffleruby/issues/2716#issuecomment-1235607492 I made some measurements. It looks like sub-millisecond sleeps are very imprecise, on all Ruby implementations, most likely because the OS doesn't give much guarantees there.

headius commented 2 weeks ago

Did you get some conflicts when rebasing specs? Anything non-trivial?

@eregon There were a few small conflicts, I think both TruffleRuby and CRuby added the same specs. Nothing that was difficult to resolve.

Is this when running locally, e.g. under-subscribed system, or when running in CI, where we can expect it to be "problematic"?

@ioquatix Locally for me on MacOS and in CI on Linux and Windows it passed consistently. CI on MacOS consistently failed, with 100 sleeps of 0.1ms taking anywhere from 80ms to 350ms.

Apparently TruffleRuby also had issues.

JRuby passes it consistently in CI on all platforms we test (mostly Linux).

headius commented 2 weeks ago

It looks like sub-millisecond sleeps are very imprecise

Perhaps I'm looking at it wrong, but it seems like only TruffleRuby is having difficulties there. JRuby, now fixed, always has a quantum under in the range of 0.003 to 0.005 on my M1 Mac.

It's also hard to say it's the sleep that's imprecise in your testing. If it warms up does it stabilize for TR?

CRuby is always around 0.003 except when it GCs (looping over the example 10k times).

eregon commented 2 weeks ago

It's the same for all Rubies, in theory they should all print 0.10... but they don't (this is on a pretty idle Linux desktop):

$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
0.15385151699956623
0.1538250559988228
0.1538307459995849

$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
jruby 9.4.9.0-SNAPSHOT (3.1.4) 2024-11-04 746b84aa02 OpenJDK 64-Bit Server VM 17.0.12+7 on 17.0.12+7 +jit [x86_64-linux]
0.16409351599941147
0.15885490400069102
0.15931887199985795

$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM Native [x86_64-linux]
0.16984848700121802
0.16197076400021615
0.16043854000054125

$ ruby -v -rbenchmark -e '3.times { p Benchmark.realtime { 1000.times { sleep 0.0001 } } }'
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
0.16442835199995898
0.16147058500064304
0.15863075099878188
eregon commented 2 weeks ago

Using strace -T (+ --follow-forks on JRuby), they all do:

futex(0x822210, FUTEX_WAKE_PRIVATE, 1)  = 0 <0.000016>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488181439}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000148>
futex(0x822210, FUTEX_WAKE_PRIVATE, 1)  = 0 <0.000015>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488404312}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000148>
futex(0x822210, FUTEX_WAKE_PRIVATE, 1)  = 0 <0.000015>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488626766}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000149>
futex(0x822210, FUTEX_WAKE_PRIVATE, 1)  = 0 <0.000016>
futex(0x823260, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=11789, tv_nsec=488848929}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) <0.000149>

Notice how each sleep is too long by about 50% / 0.00005s.

eregon commented 2 weeks ago

My take on this is we are not using real-time operating systems, so sleep() is at least as much as passed, with pretty much no guaranteed upper bound under load or when the sleep amount is small (e.g. below 1ms).

ioquatix commented 2 weeks ago

Nice analysis! It's great to see the original problem I had (sleep(small x) = no op) gone and all platforms behaving very similarly.