Closed headius closed 10 years ago
See also jruby/jruby#1080
I believe this issue is fixed by rubygems/rubygems-aws@4e0ee9e74b4b0f9c57b57930b44bd64ad288c22e! Please report any new failures you witness, as they are likely to be a different problem than this one.
My $0.02, but increasing the timeout is only going to be a work around for using that rubygem-aws when fetching from the s3 bucket. Because for testing this fetch from 'https://rubygems.org/gems/dicks-0.03.gem', which gives a 302 to 'https://s3.amazonaws.com/production.s3.rubygems.org/gems/dicks-0.03.gem', if you run the same reproduction code against the direct s3 bucket, the issue is still reproducible. This handshake interruption looks to be more in the ssl implementation (not sure yet whether client and/or server side)
As far as I can tell, this is S3's DDoS protection kicking in, not an SSL failure:
Errno::ECONNREFUSED: Connection refused - Connection refused
initialize at org/jruby/ext/socket/RubyTCPSocket.java:126
@vbatts Can you elaborate on "if you run the same reproduction code against the direct s3 bucket, the issue is still reproducible"? I have not been able to reproduce this issue fetching directly from S3...only fetching from RG.org.
I'll do some testing today to see if we still see issues fetching gems.
I did some testing again today. The fetches from rg.org fail very quickly with my above reproduction (Ubuntu Linux, Oracle JDK, JRuby 1.7.5) and the fetches from S3 do not fail at all (or at least run longer than I was willing to wait without failure).
That is so strange. When I tested on my local machine, I would consistently get an ECONNREFUSED
after a minute or two if I was aimed at S3. If I point it directly at rubygems.org, it works for as long as I have been willing to let it run so far.
(Just for the record, I have been running it for almost 20 minutes straight, pointed at rubygems.org, without any errors at all. I'm going to give up now and give the poor people at this coworking space their bandwidth back, but if there's another way to reproduce it, I'm very interested.)
It may only happen with the Oracle JDK (not OpenJDK) because it installs permissions for higher crypto strength. I'm going to test that theory now, but in general I've had best luck reproducing with Oracle JDK.
I was testing with recent openjdk7. I will have to run my tests again against s3, but I did reproduce it there. On Oct 28, 2013 3:48 PM, "Charles Oliver Nutter" notifications@github.com wrote:
It may only happen with the Oracle JDK (not OpenJDK) because it installs permissions for higher crypto strength. I'm going to test that theory now, but in general I've had best luck reproducing with Oracle JDK.
— Reply to this email directly or view it on GitHubhttps://github.com/rubygems/rubygems.org/issues/615#issuecomment-27248788 .
I found this issue is still valid while using jRuby 1.7.5 and rubygems via SSL (https).
this issue still persists as of today. why is this issue closed?
Confirmed occurring today on JRuby 1.7.5 + Oracle JVM 1.7.0_45-b18 (on OSX)
The particular thing causing this exact issue was discovered and (we think) resolved. If you are still experiencing issues, please open a new ticket with reproduction steps.
The particular thing causing this exact issue was not resolved and the command from the OP still works to reproduce this error.
I ran this for more than 20 minutes and I'm unable to reproduce.
We have noticed that it seems to occur more with Oracle JDK instead of plain OpenJDK but have no strong evidence. I will try to reproduce again myself.
When I tried to reproduce before the change, I saw this error in less than 5 minutes every time. After the fix, I ran the command for over 40 minutes with no errors for the first time. That is what makes me think that there may be something new or different going on with the remaining issues.
I hit this pretty frequently with day to day gem installation on JRuby. http://build.logstash.net/view/logstash/job/logstash-daily/18/console is one such example.
@jordansissel The error in your log file ("Gem::RemoteFetcher::FetchError: Received fatal alert: bad_record_mac") is not the error that I was able to reproduce using @headius' example script before we made the configuration change referenced in this ticket. It is also not an error I am able to reproduce. Why do you believe it is the issue discussed in this ticket? Do you have any other information about it, or steps for reproducing it?
@indirect Apologies for the confusion. I mostly found this ticket while googling for the error message. As for why do I believe my issue is related:
How my issue is discussed (if at all, at your discretion), is not important to me. I am happy if you wish to close this and have my issue discussed in a separate ticket. In the meantime, I am doing a few things to try and work around this until a proper resolution can be identified.
Further to my comment above, using @headius sample code to reproduce this issue, I was able to do about 200 requests before it failed with this: "OpenSSL::SSL::SSLError: Received fatal alert: bad_record_mac" (Jruby 1.7.8)
Blah. So either the attempted fix has regressed, or your network is able to surface the problem while mine is not. Thanks for the additional information!
I will attach pcaps for two tcp/ssl sessions as well; one successful one failing.
Cherrypicked tcp sessions from a tcpdump using wireshark:
There are also Java properties you can pass (-J-D when passed through JRuby) that dump out debug info for the connection. I'd do it myself but I'm in flight right now on a slow connection.
http://docs.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#Debug
I modified @headius original code to be single-threaded to make it easier to read the output with debugging on. Using jruby -J-Djavax.net.debug=ssl,ssl:handshake:verbose
http://semicomplete.com/files/rubygems615/java-ssl-debug.txt (22mb)
And a trimmed-down version of the full file above with what I believe is the full debug output for the last (and final, failling) connection attempt : http://semicomplete.com/files/rubygems615/java-ssl-debug-last-request.txt (68kb)
FWIW, the threading in the original code was just to speed up the process of reproducing the error. A single-threaded version fails just the same, but takes longer.
After a bit more searching on the internet, it sounds like other projects that have encountered this bad_record_mac attribute it to SSLv2 being default, and that using SSLv3 is the fix. open-uri uses OpenSSL::SSL::SSLContext for https:// connections, and I only see a setter for ssl_version, so I am not quite certain what version it is defaulting to. Using the following code to handle redirects, and force sslv3, I have let this run for an extended time and have not received the error. @jordansissel could you run this from the system you are able to reproduce from, and see whether setting SSLv3 makes any difference?
require 'net/https'
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
uri = URI.parse(uri_str)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.ssl_version = 'SSLv3'
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
#warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
Thread.abort_on_exception = true
10.times {
Thread.new {
loop {
puts fetch('https://rubygems.org/gems/dicks-0.03.gem')
}
}
}
sleep
@vbatts In the pcap I posted, the 'bad' handshake advertises is initiated by the client using TLS 1.0, not SSL v2
@vbatts but to answer your question completely, your sample code bombed out as well:
...
#<Net::HTTPOK:0xa4fb7c>
#<Net::HTTPOK:0x1b9311da>
#<Net::HTTPOK:0x54790d5a>
OpenSSL::SSL::SSLError: Received fatal alert: bad_record_mac
This error seems to encompass many possible failures of handshaking. The timeout issue is one, picking the wrong protocol is another. With the timeout issue fixed, it could indeed be failure to choose SSLv3 now, bit I would expect to see it fail more consistently were that the case.
@headius best I can tell (from my pcap) the protocol used on both sides is TLS 1.0
I can confirm the PR mentioned above (https://github.com/logstash/logstash/pull/819) works around this problem for me. I have observed it logging failures but the failure is caught and my code simply retries a fetch to rubygems and is happy.
Bundler has also added retry code in the latest minor (currently a release candidate). I would love to figure out why we’re seeing this error so consistently (albeit rarely), though. :\
On Dec 3, 2013, at 12:08 AM, Jordan Sissel notifications@github.com wrote:
I can confirm the PR mentioned above (logstash/logstash#819) works around this problem for me. I have observed it logging failures but the failure is caught and my code simply retries a fetch to rubygems and is happy.
— Reply to this email directly or view it on GitHub.
I think I'm seeing a similar issue. I'm using JRuby+Fog to upload some files into an S3 bucket. As soon as a file exceeds some given size, I see the same failures. Files are generated with
dd if=/dev/zero of=./100K bs=1024 count=100
Uploading via HTTPS
haduken:debug gkuchta$ ruby --version
jruby 1.7.8 (1.9.3p392) 2013-12-02 0ce429e on Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02 [darwin-x86_64]
haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/100K... success!
trying to upload /Users/gkuchta/debug/data/1024K... Failed with: Broken pipe (IOError)
trying to upload /Users/gkuchta/debug/data/200K... Failed with: Received fatal alert: bad_record_mac (IOError)
trying to upload /Users/gkuchta/debug/data/300K... Failed with: Received fatal alert: bad_record_mac (IOError)
trying to upload /Users/gkuchta/debug/data/400K... Failed with: Received fatal alert: bad_record_mac (IOError)
trying to upload /Users/gkuchta/debug/data/500K...
^C Failed with: Bad file descriptor - Bad file descriptor
At the time of writing, an upload of any file < 128K succeeds; any file >= 128K will fail. That said, the returned exception seems to vary.
haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/125K... success!
trying to upload /Users/gkuchta/debug/data/126K... success!
trying to upload /Users/gkuchta/debug/data/127K... success!
trying to upload /Users/gkuchta/debug/data/128K... Failed with: Expected(200) <=> Actual(400 Bad Request)
Well, to be more specific, any file > 1024*128-(512+64+28) bytes will fail:
haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/130467bytes... success!
trying to upload /Users/gkuchta/debug/data/130468bytes... success!
trying to upload /Users/gkuchta/debug/data/130469bytes... Failed with: Expected(200) <=> Actual(400 Bad Request)
Uploading via plain HTTP, on the other hand:
haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/100K... success!
trying to upload /Users/gkuchta/debug/data/1024K... success!
trying to upload /Users/gkuchta/debug/data/200K... success!
trying to upload /Users/gkuchta/debug/data/300K... success!
trying to upload /Users/gkuchta/debug/data/400K... success!
trying to upload /Users/gkuchta/debug/data/500K... success!
trying to upload /Users/gkuchta/debug/data/600K... success!
trying to upload /Users/gkuchta/debug/data/700K... success!
trying to upload /Users/gkuchta/debug/data/800K... success!
trying to upload /Users/gkuchta/debug/data/900K... success!
Both the HTTP and HTTPS cases work under MRI.
Also, from what I understand, S3's response of "Failed with: Expected(200) <=> Actual(400 Bad Request)" indicates that it was left waiting, as the amount of transfer'd data was less than it was expecting.
foo.rb:
require 'rubygems'
require 'fog'
connection = Fog::Storage.new({
:provider => 'AWS',
:aws_access_key_id => 'XXX',
:aws_secret_access_key => 'XXX',
:region => 'us-west-2',
:scheme => 'https' # or 'http'
})
directory = connection.directories.new(:key => 'XXX')
files = Dir.glob("/Users/gkuchta/debug/data/*")
files.each do |file|
print "trying to upload #{file}..."
begin
upload = directory.files.create(
:key => file.split('/').last,
:body => File.open(file),
:public => false
)
print " success!\n"
rescue Exception => e
print " Failed with: #{e.message}\n"
end
end
I'm seeing the same issue as @gkuchta — uploading a file with SSL to S3 goes wrong with a 'bad_error_mac' if the file is over a certain size threshold, in my case around 200K. I was scratching my head about this for quite some time until I read @gkuchta's comment and tried reducing the file size below 100K, and it started working without a hitch.
I can only reproduce this with JRuby, so it seems that the problem may lie with them, or even the specific JRE version.
I'm using jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18 [darwin-x86_64].
@simonask While probably not terribly helpful, as you're on OSX, the OpenJDK doesn't seem to exhibit this behavior. FWIW, I tried a few different hotspot JVM versions, all of which failed.
@headius This issue has been affecting me extensively in JRuby 1.7.8 on Oracle JDK 7 u 45 on both OSX and Ubuntu 12.04. It is not specific to RubyGems; in fact it impacts SSL connections with the Keen IO Ruby API as well as the Stripe Ruby API. There are some SSL web service APIs (Google Maps, for example) that seem to be unaffected. It appears that #1331 could be the same thing as well, but I have not experienced the error yet while uploading to AWS, but then again my upload volume is fairly low.
I've spent a lot of time debugging this, and my javax.net.debug output looks a lot like that of @jordansissel.
I am still trying to determine if this is actually a JRuby issue or is a JVM issue, but that is difficult since there is a fair amount of complexity in the jruby-openssl bridge. I have been unable to recreate it so far outside of JRuby.
My observations are: --This issue is not specific to SSLv3 or TLSv1 or TLSv1.1, I can recreate it using all of them --Connections from my pure Java API clients tend to reuse SSL sessions, where my JRuby API clients recreate everything for each request --My pure Java API clients use different cyphers. For example I connect to Stripe with TLS_ECDHE_RSA_WITH_RC4_128_SHA using the Java API, but with TLS_DHE_RSA_WITH_AES_128_CBC_SHA when using the Ruby API. --In jruby-openssl, the CypherStrings mapping class seems to disable a large number of the cypher suites available to Java in general as they are not mapped out. For example, it seems like the JRuby SSLEngine has disabled all elliptical curve cyphers. This might just be the default config however..
I think this is perhaps a more significant issue than previously thought, but like I said it might actually be a JDK problem. Not sure yet..
Oh one more thing.. To recreate this issue, I make 1000 duplicate SSL requests in a loop and it always fails intermittently at some point during that sequence. It's not just my local network either , as this also occurs in my production environment.
I've been hunting this bug too @kgx . I'm not sure if actual bug is completely client side or server side. I've only hit it when using jruby SSL to rubygems.org.
I'm having trouble finding a local test case. Were you able to reproduce it using servers leveraging Keen IO Ruby API or Stripe Ruby API?
jruby -ropen-uri -e "Thread.abort_on_exception = true ; 10.times { Thread.new { loop { open('https://rubygems.org/').read; p :ok } } }; sleep"
I was able to reproduce the fail cases using nginx-1.0.15-5.el6.x86_64 ssl server on an old centos 6.2 image with openssl-1.0.0-20.el6_2.5.x86_64
Since this seems like more of a jruby issue than a rubygems.org issue at the moment, i'm going to continue my comments @ https://github.com/jruby/jruby/issues/1331
@gmanfunky I agree this is more of jruby issue. When I posted a few weeks back I mistakenly thought this issue was in the jruby repo. Continuing comments at jruby/jruby#1331 as well...
As far as I was able to determine, and as far as I'm still seeing, this is only an issue with the OracleJDK. I have had no issue with OpenJDK. Are you folks seeing that, too?
Indeed I have noticed that too. Perhaps we should lean on someone from Oracle related to the SSL subsystem?
@headius @gkuchta I have also been unable to recreate this issue using OpenJDK 7u55-2.4.7-1ubuntu1~0.12.04.2.
Thankfully changing JDKs is a viable workaround for server environments where OpenJDK is available.
I just ran across this bug via a completely different route - Braintree's payment gateway - and they said to upgrade to Oracle JDK 1.7u51 or later. We upgraded to 1.7u67 and the problem seems to have gone away. I don't know what the "bottom" version of Java was that introduced this (although I found reports of similar problems dating back to Java 5 and Java 6...).
@seancorfield Did they reference a bug number? I was able to reproduce this consistently only on certain Oracle/OpenJDK builds of Java 7. I'd like to know what they fixed, to see if we can introduce a workaround.
@headius Unfortunately not. This is the only public information I could find from them: https://status.braintreepayments.com/incidents/bd0sr4jpbt3b
I followed up with their support/tech folks and didn't get any more details :(
The only JDK bug I can find that seems related is this https://bugs.openjdk.java.net/browse/JDK-8030806 (which was closed as "Cannot Reproduce"). It mentions SSLv2 vs SSLv3 in handshaking, as well as the bad_record_mac intermittent failures. Bugs relating to that exception have been logged against a number of projects but I wasn't able to find anything useful and concrete in terms of fixes (other than upgrade the JDK and, perhaps, force the initial handshake to SSLv3?).
@seancorfield Yeah I think I've seen that JDK bug before, when investigating this issue.
I've pinged Braintree on Twitter to see if someone there might be able to get us more information.
Some time in the past few months (or weeks, unsure) we started to get reports from JRuby users that they were having intermittent failures fetching gems from RubyGems.org. The issue manifests itself as a low-level SSL exception of "bad_record_mac" or "bad record mac", usually during the first connection handshake to fetch a gem.
I have been able to reproduce this on JRuby versions as far back as JRuby 1.7.2, which was released last winter. Here's my command line:
It takes dozens or hundreds of requests to fail, but eventually fails on every version of JRuby I have tried.
If I switch to http:// it works fine.
If I upload the gem to s3 and fetch from there, it works fine.
See also #613 which may be related.
Notes from my exploration of the problem: