rubygems / rubygems.org

The Ruby community's gem hosting service.
https://rubygems.org
MIT License
2.32k stars 917 forks source link

Connect handshake intermittent failures during past few months #615

Closed headius closed 10 years ago

headius commented 11 years ago

Some time in the past few months (or weeks, unsure) we started to get reports from JRuby users that they were having intermittent failures fetching gems from RubyGems.org. The issue manifests itself as a low-level SSL exception of "bad_record_mac" or "bad record mac", usually during the first connection handshake to fetch a gem.

I have been able to reproduce this on JRuby versions as far back as JRuby 1.7.2, which was released last winter. Here's my command line:

jruby -ropen-uri -e "Thread.abort_on_exception = true ; 10.times { Thread.new { loop { open('https://rubygems.org/gems/dicks-0.03.gem').read; p :ok } } }; sleep"

It takes dozens or hundreds of requests to fail, but eventually fails on every version of JRuby I have tried.

If I switch to http:// it works fine.

If I upload the gem to s3 and fetch from there, it works fine.

See also #613 which may be related.

Notes from my exploration of the problem:

headius commented 11 years ago

See also jruby/jruby#1080

indirect commented 11 years ago

I believe this issue is fixed by rubygems/rubygems-aws@4e0ee9e74b4b0f9c57b57930b44bd64ad288c22e! Please report any new failures you witness, as they are likely to be a different problem than this one.

vbatts commented 10 years ago

My $0.02, but increasing the timeout is only going to be a work around for using that rubygem-aws when fetching from the s3 bucket. Because for testing this fetch from 'https://rubygems.org/gems/dicks-0.03.gem', which gives a 302 to 'https://s3.amazonaws.com/production.s3.rubygems.org/gems/dicks-0.03.gem', if you run the same reproduction code against the direct s3 bucket, the issue is still reproducible. This handshake interruption looks to be more in the ssl implementation (not sure yet whether client and/or server side)

indirect commented 10 years ago

As far as I can tell, this is S3's DDoS protection kicking in, not an SSL failure:

Errno::ECONNREFUSED: Connection refused - Connection refused
   initialize at org/jruby/ext/socket/RubyTCPSocket.java:126
headius commented 10 years ago

@vbatts Can you elaborate on "if you run the same reproduction code against the direct s3 bucket, the issue is still reproducible"? I have not been able to reproduce this issue fetching directly from S3...only fetching from RG.org.

I'll do some testing today to see if we still see issues fetching gems.

headius commented 10 years ago

I did some testing again today. The fetches from rg.org fail very quickly with my above reproduction (Ubuntu Linux, Oracle JDK, JRuby 1.7.5) and the fetches from S3 do not fail at all (or at least run longer than I was willing to wait without failure).

indirect commented 10 years ago

That is so strange. When I tested on my local machine, I would consistently get an ECONNREFUSED after a minute or two if I was aimed at S3. If I point it directly at rubygems.org, it works for as long as I have been willing to let it run so far.

indirect commented 10 years ago

(Just for the record, I have been running it for almost 20 minutes straight, pointed at rubygems.org, without any errors at all. I'm going to give up now and give the poor people at this coworking space their bandwidth back, but if there's another way to reproduce it, I'm very interested.)

headius commented 10 years ago

It may only happen with the Oracle JDK (not OpenJDK) because it installs permissions for higher crypto strength. I'm going to test that theory now, but in general I've had best luck reproducing with Oracle JDK.

vbatts commented 10 years ago

I was testing with recent openjdk7. I will have to run my tests again against s3, but I did reproduce it there. On Oct 28, 2013 3:48 PM, "Charles Oliver Nutter" notifications@github.com wrote:

It may only happen with the Oracle JDK (not OpenJDK) because it installs permissions for higher crypto strength. I'm going to test that theory now, but in general I've had best luck reproducing with Oracle JDK.

— Reply to this email directly or view it on GitHubhttps://github.com/rubygems/rubygems.org/issues/615#issuecomment-27248788 .

swistaczek commented 10 years ago

I found this issue is still valid while using jRuby 1.7.5 and rubygems via SSL (https).

hollow commented 10 years ago

this issue still persists as of today. why is this issue closed?

jordansissel commented 10 years ago

Confirmed occurring today on JRuby 1.7.5 + Oracle JVM 1.7.0_45-b18 (on OSX)

indirect commented 10 years ago

The particular thing causing this exact issue was discovered and (we think) resolved. If you are still experiencing issues, please open a new ticket with reproduction steps.

hollow commented 10 years ago

The particular thing causing this exact issue was not resolved and the command from the OP still works to reproduce this error.

dwradcliffe commented 10 years ago

I ran this for more than 20 minutes and I'm unable to reproduce.

headius commented 10 years ago

We have noticed that it seems to occur more with Oracle JDK instead of plain OpenJDK but have no strong evidence. I will try to reproduce again myself.

indirect commented 10 years ago

When I tried to reproduce before the change, I saw this error in less than 5 minutes every time. After the fix, I ran the command for over 40 minutes with no errors for the first time. That is what makes me think that there may be something new or different going on with the remaining issues.

jordansissel commented 10 years ago

I hit this pretty frequently with day to day gem installation on JRuby. http://build.logstash.net/view/logstash/job/logstash-daily/18/console is one such example.

indirect commented 10 years ago

@jordansissel The error in your log file ("Gem::RemoteFetcher::FetchError: Received fatal alert: bad_record_mac") is not the error that I was able to reproduce using @headius' example script before we made the configuration change referenced in this ticket. It is also not an error I am able to reproduce. Why do you believe it is the issue discussed in this ticket? Do you have any other information about it, or steps for reproducing it?

jordansissel commented 10 years ago

@indirect Apologies for the confusion. I mostly found this ticket while googling for the error message. As for why do I believe my issue is related:

How my issue is discussed (if at all, at your discretion), is not important to me. I am happy if you wish to close this and have my issue discussed in a separate ticket. In the meantime, I am doing a few things to try and work around this until a proper resolution can be identified.

jordansissel commented 10 years ago

Further to my comment above, using @headius sample code to reproduce this issue, I was able to do about 200 requests before it failed with this: "OpenSSL::SSL::SSLError: Received fatal alert: bad_record_mac" (Jruby 1.7.8)

indirect commented 10 years ago

Blah. So either the attempted fix has regressed, or your network is able to surface the problem while mine is not. Thanks for the additional information!

jordansissel commented 10 years ago

I will attach pcaps for two tcp/ssl sessions as well; one successful one failing.

jordansissel commented 10 years ago

Cherrypicked tcp sessions from a tcpdump using wireshark:

headius commented 10 years ago

There are also Java properties you can pass (-J-D when passed through JRuby) that dump out debug info for the connection. I'd do it myself but I'm in flight right now on a slow connection.

http://docs.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#Debug

jordansissel commented 10 years ago

I modified @headius original code to be single-threaded to make it easier to read the output with debugging on. Using jruby -J-Djavax.net.debug=ssl,ssl:handshake:verbose

http://semicomplete.com/files/rubygems615/java-ssl-debug.txt (22mb)

And a trimmed-down version of the full file above with what I believe is the full debug output for the last (and final, failling) connection attempt : http://semicomplete.com/files/rubygems615/java-ssl-debug-last-request.txt (68kb)

headius commented 10 years ago

FWIW, the threading in the original code was just to speed up the process of reproducing the error. A single-threaded version fails just the same, but takes longer.

vbatts commented 10 years ago

After a bit more searching on the internet, it sounds like other projects that have encountered this bad_record_mac attribute it to SSLv2 being default, and that using SSLv3 is the fix. open-uri uses OpenSSL::SSL::SSLContext for https:// connections, and I only see a setter for ssl_version, so I am not quite certain what version it is defaulting to. Using the following code to handle redirects, and force sslv3, I have let this run for an extended time and have not received the error. @jordansissel could you run this from the system you are able to reproduce from, and see whether setting SSLv3 makes any difference?

require 'net/https'

def fetch(uri_str, limit = 10) 
  # You should choose a better exception.
  raise ArgumentError, 'too many HTTP redirects' if limit == 0

  uri = URI.parse(uri_str)
  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true
  http.ssl_version = 'SSLv3'
  http.verify_mode = OpenSSL::SSL::VERIFY_NONE

  request = Net::HTTP::Get.new(uri.request_uri)
  response = http.request(request)

  case response
  when Net::HTTPSuccess then
    response
  when Net::HTTPRedirection then
    location = response['location']
    #warn "redirected to #{location}"
    fetch(location, limit - 1)
  else
    response.value
  end 
end

Thread.abort_on_exception = true
10.times {
  Thread.new {
    loop {
      puts fetch('https://rubygems.org/gems/dicks-0.03.gem')
    }   
  }
}
sleep
jordansissel commented 10 years ago

@vbatts In the pcap I posted, the 'bad' handshake advertises is initiated by the client using TLS 1.0, not SSL v2

jordansissel commented 10 years ago

@vbatts but to answer your question completely, your sample code bombed out as well:

...
#<Net::HTTPOK:0xa4fb7c>
#<Net::HTTPOK:0x1b9311da>
#<Net::HTTPOK:0x54790d5a>
OpenSSL::SSL::SSLError: Received fatal alert: bad_record_mac
headius commented 10 years ago

This error seems to encompass many possible failures of handshaking. The timeout issue is one, picking the wrong protocol is another. With the timeout issue fixed, it could indeed be failure to choose SSLv3 now, bit I would expect to see it fail more consistently were that the case.

jordansissel commented 10 years ago

@headius best I can tell (from my pcap) the protocol used on both sides is TLS 1.0

jordansissel commented 10 years ago

I can confirm the PR mentioned above (https://github.com/logstash/logstash/pull/819) works around this problem for me. I have observed it logging failures but the failure is caught and my code simply retries a fetch to rubygems and is happy.

indirect commented 10 years ago

Bundler has also added retry code in the latest minor (currently a release candidate). I would love to figure out why we’re seeing this error so consistently (albeit rarely), though. :\

On Dec 3, 2013, at 12:08 AM, Jordan Sissel notifications@github.com wrote:

I can confirm the PR mentioned above (logstash/logstash#819) works around this problem for me. I have observed it logging failures but the failure is caught and my code simply retries a fetch to rubygems and is happy.

— Reply to this email directly or view it on GitHub.

gkuchta commented 10 years ago

I think I'm seeing a similar issue. I'm using JRuby+Fog to upload some files into an S3 bucket. As soon as a file exceeds some given size, I see the same failures. Files are generated with

dd if=/dev/zero of=./100K bs=1024 count=100

Uploading via HTTPS

haduken:debug gkuchta$ ruby --version
jruby 1.7.8 (1.9.3p392) 2013-12-02 0ce429e on Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02 [darwin-x86_64]

haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/100K... success!
trying to upload /Users/gkuchta/debug/data/1024K... Failed with: Broken pipe (IOError)
trying to upload /Users/gkuchta/debug/data/200K... Failed with: Received fatal alert: bad_record_mac (IOError)
trying to upload /Users/gkuchta/debug/data/300K... Failed with: Received fatal alert: bad_record_mac (IOError)
trying to upload /Users/gkuchta/debug/data/400K... Failed with: Received fatal alert: bad_record_mac (IOError)
trying to upload /Users/gkuchta/debug/data/500K...

^C Failed with: Bad file descriptor - Bad file descriptor

At the time of writing, an upload of any file < 128K succeeds; any file >= 128K will fail. That said, the returned exception seems to vary.

haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/125K... success!
trying to upload /Users/gkuchta/debug/data/126K... success!
trying to upload /Users/gkuchta/debug/data/127K... success!
trying to upload /Users/gkuchta/debug/data/128K... Failed with: Expected(200) <=> Actual(400 Bad Request)

Well, to be more specific, any file > 1024*128-(512+64+28) bytes will fail:

haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/130467bytes... success!
trying to upload /Users/gkuchta/debug/data/130468bytes... success!
trying to upload /Users/gkuchta/debug/data/130469bytes... Failed with: Expected(200) <=> Actual(400 Bad Request)

Uploading via plain HTTP, on the other hand:

haduken:debug gkuchta$ ruby foo.rb
trying to upload /Users/gkuchta/debug/data/100K... success!
trying to upload /Users/gkuchta/debug/data/1024K... success!
trying to upload /Users/gkuchta/debug/data/200K... success!
trying to upload /Users/gkuchta/debug/data/300K... success!
trying to upload /Users/gkuchta/debug/data/400K... success!
trying to upload /Users/gkuchta/debug/data/500K... success!
trying to upload /Users/gkuchta/debug/data/600K... success!
trying to upload /Users/gkuchta/debug/data/700K... success!
trying to upload /Users/gkuchta/debug/data/800K... success!
trying to upload /Users/gkuchta/debug/data/900K... success!

Both the HTTP and HTTPS cases work under MRI.

Also, from what I understand, S3's response of "Failed with: Expected(200) <=> Actual(400 Bad Request)" indicates that it was left waiting, as the amount of transfer'd data was less than it was expecting.

foo.rb:

require 'rubygems'
require 'fog'

connection = Fog::Storage.new({
  :provider                 => 'AWS',
  :aws_access_key_id        => 'XXX',
  :aws_secret_access_key    => 'XXX',
  :region                   => 'us-west-2',
  :scheme                   => 'https' # or 'http'
})

directory = connection.directories.new(:key => 'XXX')

files = Dir.glob("/Users/gkuchta/debug/data/*")
files.each do |file|
  print "trying to upload #{file}..."
  begin
    upload = directory.files.create(
      :key    => file.split('/').last,
      :body   => File.open(file),
      :public => false
    )
    print " success!\n"
  rescue Exception => e
    print " Failed with: #{e.message}\n"
  end
end
simonask commented 10 years ago

I'm seeing the same issue as @gkuchta — uploading a file with SSL to S3 goes wrong with a 'bad_error_mac' if the file is over a certain size threshold, in my case around 200K. I was scratching my head about this for quite some time until I read @gkuchta's comment and tried reducing the file size below 100K, and it started working without a hitch.

I can only reproduce this with JRuby, so it seems that the problem may lie with them, or even the specific JRE version.

I'm using jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18 [darwin-x86_64].

gkuchta commented 10 years ago

@simonask While probably not terribly helpful, as you're on OSX, the OpenJDK doesn't seem to exhibit this behavior. FWIW, I tried a few different hotspot JVM versions, all of which failed.

kgx commented 10 years ago

@headius This issue has been affecting me extensively in JRuby 1.7.8 on Oracle JDK 7 u 45 on both OSX and Ubuntu 12.04. It is not specific to RubyGems; in fact it impacts SSL connections with the Keen IO Ruby API as well as the Stripe Ruby API. There are some SSL web service APIs (Google Maps, for example) that seem to be unaffected. It appears that #1331 could be the same thing as well, but I have not experienced the error yet while uploading to AWS, but then again my upload volume is fairly low.

I've spent a lot of time debugging this, and my javax.net.debug output looks a lot like that of @jordansissel.

I am still trying to determine if this is actually a JRuby issue or is a JVM issue, but that is difficult since there is a fair amount of complexity in the jruby-openssl bridge. I have been unable to recreate it so far outside of JRuby.

My observations are: --This issue is not specific to SSLv3 or TLSv1 or TLSv1.1, I can recreate it using all of them --Connections from my pure Java API clients tend to reuse SSL sessions, where my JRuby API clients recreate everything for each request --My pure Java API clients use different cyphers. For example I connect to Stripe with TLS_ECDHE_RSA_WITH_RC4_128_SHA using the Java API, but with TLS_DHE_RSA_WITH_AES_128_CBC_SHA when using the Ruby API. --In jruby-openssl, the CypherStrings mapping class seems to disable a large number of the cypher suites available to Java in general as they are not mapped out. For example, it seems like the JRuby SSLEngine has disabled all elliptical curve cyphers. This might just be the default config however..

I think this is perhaps a more significant issue than previously thought, but like I said it might actually be a JDK problem. Not sure yet..

kgx commented 10 years ago

Oh one more thing.. To recreate this issue, I make 1000 duplicate SSL requests in a loop and it always fails intermittently at some point during that sequence. It's not just my local network either , as this also occurs in my production environment.

gmanfunky commented 10 years ago

I've been hunting this bug too @kgx . I'm not sure if actual bug is completely client side or server side. I've only hit it when using jruby SSL to rubygems.org.

I'm having trouble finding a local test case. Were you able to reproduce it using servers leveraging Keen IO Ruby API or Stripe Ruby API?

jruby -ropen-uri -e "Thread.abort_on_exception = true ; 10.times { Thread.new { loop { open('https://rubygems.org/').read; p :ok } } }; sleep"
gmanfunky commented 10 years ago

I was able to reproduce the fail cases using nginx-1.0.15-5.el6.x86_64 ssl server on an old centos 6.2 image with openssl-1.0.0-20.el6_2.5.x86_64

Since this seems like more of a jruby issue than a rubygems.org issue at the moment, i'm going to continue my comments @ https://github.com/jruby/jruby/issues/1331

kgx commented 10 years ago

@gmanfunky I agree this is more of jruby issue. When I posted a few weeks back I mistakenly thought this issue was in the jruby repo. Continuing comments at jruby/jruby#1331 as well...

gkuchta commented 10 years ago

As far as I was able to determine, and as far as I'm still seeing, this is only an issue with the OracleJDK. I have had no issue with OpenJDK. Are you folks seeing that, too?

headius commented 10 years ago

Indeed I have noticed that too. Perhaps we should lean on someone from Oracle related to the SSL subsystem?

kgx commented 10 years ago

@headius @gkuchta I have also been unable to recreate this issue using OpenJDK 7u55-2.4.7-1ubuntu1~0.12.04.2.

Thankfully changing JDKs is a viable workaround for server environments where OpenJDK is available.

seancorfield commented 10 years ago

I just ran across this bug via a completely different route - Braintree's payment gateway - and they said to upgrade to Oracle JDK 1.7u51 or later. We upgraded to 1.7u67 and the problem seems to have gone away. I don't know what the "bottom" version of Java was that introduced this (although I found reports of similar problems dating back to Java 5 and Java 6...).

headius commented 10 years ago

@seancorfield Did they reference a bug number? I was able to reproduce this consistently only on certain Oracle/OpenJDK builds of Java 7. I'd like to know what they fixed, to see if we can introduce a workaround.

seancorfield commented 10 years ago

@headius Unfortunately not. This is the only public information I could find from them: https://status.braintreepayments.com/incidents/bd0sr4jpbt3b

I followed up with their support/tech folks and didn't get any more details :(

The only JDK bug I can find that seems related is this https://bugs.openjdk.java.net/browse/JDK-8030806 (which was closed as "Cannot Reproduce"). It mentions SSLv2 vs SSLv3 in handshaking, as well as the bad_record_mac intermittent failures. Bugs relating to that exception have been logged against a number of projects but I wasn't able to find anything useful and concrete in terms of fixes (other than upgrade the JDK and, perhaps, force the initial handshake to SSLv3?).

headius commented 10 years ago

@seancorfield Yeah I think I've seen that JDK bug before, when investigating this issue.

I've pinged Braintree on Twitter to see if someone there might be able to get us more information.