Open hsbt opened 4 months ago
@hsbt That's strange... is it only that one test? How easy is it to reproduce? Is it failing ~5% of the time or 95% or? I haven't noticed it failing in the CI workflow. I'm only running on Linux, and haven't seen it myself. I wonder if it's unique to Apple Silicon or a specific version of OpenSSL or something else that's different from macos-latest
in GH actions.
I can see one thing that's unique about that test. For all of the tests that create an SSL connection:
starttls_test
imaps_test
run_fake_server_in_thread
(or with_fake_server
, which delegates to run_fake_server_in_thread
)This test is one of the four that uses imaps_test
. And of those four, this is one of only two that expects a successful connection. And this is the only one that verifies the server's cert. So, if this is the only test that's flaky, the problem is probably in imaps_test
.
At first, I couldn't see anything significant between how the server is configured for imaps_test
vs any of the others. We're using the exact same SSL fixture files (ca_path
, key
, and cert
) in all three places, and we seem to be creating the SSLContext identically. But, after looking more closely, I see that imaps_test
is using OpenSSL::SSL::SSLServer.new(tcp_server, ssl_ctx)
and the other two are using OpenSSL::SSL::SSLSocket.new(tcp_socket, ssl_ctx)
.
Looking at SSLServer
, it seems like there's no difference between what it's doing and what FakeServer is doing with SSLSocket
. The only difference I can discern is that SSLServer#initialize
contains the following:
unless ctx.session_id_context
# see #6137 - session id may not exceed 32 bytes
prng = ::Random.new($0.hash)
session_id = prng.bytes(16).unpack1('H*')
@ctx.session_id_context = session_id
end
I have no idea why session_id_context
would matter for this (it probably doesn't), but it's the only difference I could see.
We could experiment with simply converting imaps_test
to use SSLSocket
directly, or even convert those four tests over to run_fake_server_in_thread
. But I'm still baffled and curious about why this would make a difference.
Should we ask about this over at https://github.com/ruby/openssl/issues?
I ran make test-bundled-gems
on ruby/ruby ~10 times with no luck. I'm also running on Linux.
<#<OpenSSL::SSL::SSLError: SSL_read: tlsv1 alert unknown ca (SSL alert number 48)>>
This OpenSSL::SSL::SSLError
on the client side seems weird. The error message says that it received the "unknown_ca" alert. It is usually sent when the server doesn't trust the client certificate, but I don't see any test cases in net-imap that use a client certificate.
session_id_context
I believe it's irrelevant. The seesion ID context is only used for TLS session resumption, which net-imap doesn't use.
That's strange... is it only that one test? How easy is it to reproduce? Is it failing ~5% of the time or 95% or?
I see this fail once every day.
I haven't noticed it failing in the CI workflow.
Yes, I also didn't see this fail our GitHub Actions. It may be caused by my development environment.
I got the following failure today.
Failure: test: Convert deprecated usessl (= true) and verify (= false), with warning(DeprecatedClientOptionsTest::InitializeTests):
exceptions on 1 threads:
#<Thread:0x00000001450976a8 /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server/test_helper.rb:10 dead>:
/Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:40:in 'Timeout::Error.handle_timeout': execution expired (Timeout::Error)
from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:192:in 'Timeout.timeout'
from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:73:in 'Net::IMAP::FakeServer#run'
from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server/test_helper.rb:11:in 'block (2 levels) in Net::IMAP::FakeServer::TestHelper#run_fake_server_in_thread'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:74:in 'TCPServer#accept': execution expired (Timeout::ExitException)
from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:74:in 'block in Net::IMAP::FakeServer#run'
from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:183:in 'block in Timeout.timeout'
from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:38:in 'Timeout::Error.handle_timeout'
from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:192:in 'Timeout.timeout'
from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:73:in 'Net::IMAP::FakeServer#run'
from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server/test_helper.rb:11:in 'block (2 levels) in Net::IMAP::FakeServer::TestHelper#run_fake_server_in_thread'
@hsbt 😦 This is still very confusing, and I still don't know if I've ever seen it. Is this still flaky, or does it happen every time?
The failure you just posted is a different error on the surface. Also, this is the server thread's timeout, and it would be very useful to know where the client thread failed, too. The server's timeout is probably caused by whatever happened in the client thread. On the other hand, run_fake_server_in_thread
calls server&.shutdown
from a different thread, and there's no mutex inside FakeServer#shutdown
, so maybe we're triggering a race condition there?
But it seems reasonable to guess that it has the same root cause. And both failures you've posted are in tests that use SSL connections. And, unlike the first failure you reported, this time it's coming from run_fake_server_in_thread
. So, assuming its the same root cause, we can rule out any strange differences between the different SSL server socket code.
It seems like it must somehow be related to your local openssl library configuration or version, but I'm not sure how to diagnose that. To start, maybe you can post your versions?
%i[VERSION OPENSSL_LIBRARY_VERSION OPENSSL_VERSION].to_h { [_1, OpenSSL.const_get(_1)] }
I assume that you have not had this sort of issue with the openssl
gem's tests. I'd guess that they are more careful about constraining the OpenSSL environment and config. So perhaps I'll take a look at those and maybe we can update net-imap
's tests to behave similarly? But if there's some issue there, e.g in how we've set up our test CA and cert or , I don't know why it would make flaky tests rather than just always failing.
I sometimes got the following failure with
ruby/ruby
andmake test-bundled-gems
.My environment is macOS 14.5 with Apple Silicon and using OpenSSL 3 library provided from homebrew.