iconara / cql-rb

Cassandra CQL 3 binary protocol driver for Ruby
106 stars 31 forks source link

No Windows Support? "Cluster disconnect failed: Bad file descriptor" Error on Windows with Cassandra 1.2.11.2 #58

Closed SimonKaluza closed 10 years ago

SimonKaluza commented 10 years ago

The following code runs fine on Ruby 1.9.3 on Linux but raises strange errors/exceptions on Ruby 1.9.3 on Windows:

require 'cql'
require 'logger'

logger = Logger.new(STDERR)
client = Cql::Client.connect(hosts: ['my_hostname_here'], logger: logger)

The output on RHEL, which is expected is:

[root@localhost cql-test]# ruby cql.rb
D, [2013-12-04T17:01:28.693543 #8190] DEBUG -- : Connecting to node at my_hostname_here:9042
I, [2013-12-04T17:01:28.705406 #8190]  INFO -- : Connected to node 70fb116d-fafd-495b-9c2b-112415d0991f at accfs:9042 in data center Solr
D, [2013-12-04T17:01:28.705548 #8190] DEBUG -- : Looking for additional nodes
D, [2013-12-04T17:01:28.707796 #8190] DEBUG -- : No additional nodes found
I, [2013-12-04T17:01:28.708128 #8190]  INFO -- : Cluster connection complete
W, [2013-12-04T17:01:28.708937 #8190]  WARN -- : Connection to node 70fb116d-fafd-495b-9c2b-112415d0991f at accfs:9042 in data center Solr unexpectedly closed

The broken output on Windows is:

D, [2013-12-04T18:39:32.868677 #16252] DEBUG -- : Connecting to node at my_hostname_here:9042
D, [2013-12-04T18:39:32.888677 #16252] DEBUG -- : Looking for additional nodes
E, [2013-12-04T18:39:32.888677 #16252] ERROR -- : Failed connecting to cluster: exception object expected
E, [2013-12-04T18:39:32.888677 #16252] ERROR -- : Cluster disconnect failed: Bad file descriptor
C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/client/connection_helper.rb:29:in `raise': exception object expected (TypeError)
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/client/connection_helper.rb:29:in `block in connect'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:76:in `try'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:146:in `block in map'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `block in resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:33:in `fulfill'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:52:in `block in observe'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `block in resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:33:in `fulfill'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:52:in `block in observe'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:281:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:281:in `on_value'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:52:in `observe'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:168:in `block in flat_map'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `block in resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:427:in `block (2 levels) in initialize'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:383:in `block in resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:382:in `resolve'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:33:in `fulfill'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:76:in `try'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:197:in `block in recover'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:405:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:405:in `block in fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:404:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:404:in `fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:44:in `fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:164:in `block in flat_map'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:405:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:405:in `block in fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:404:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:404:in `fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:44:in `fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:144:in `block in map'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:405:in `call'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:405:in `block in fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:404:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:404:in `fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/future.rb:44:in `fail'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/connection.rb:212:in `closed!'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/connection.rb:76:in `close'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb:291:in `block in close_sockets'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb:289:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb:289:in `close_sockets'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb:141:in `ensure in block in start'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb:147:in `block in start'

I tried debugging in to see what caused this error, but the multithreaded nature of the promises/future failures made it difficult to pinpoint the error. Is there no support for Windows machines? If someone can point me toward a few relevant classes in the source code I'd work on a patch and submit it.

SimonKaluza commented 10 years ago

Does the library currently rely on POSIX-specific sockets? Just occurred to me that "Cluster disconnect failed: Bad file descriptor" smacks of POSIX standards and those types of sockets might be nonexistent in a Windows environment.

iconara commented 10 years ago

cql-rb uses the standard Ruby Socket API (Socket and IO.select basically). I don't know how those work on Windows (poorly it seems from your report).

It would require too much effort for me to fix this, I wouldn't even know where to start, but I'd be very grateful for patches. The relevant code is in the Cql::Io module, specifically Cql::Io::Connection and Cql::Io::IoLoopBody#check_sockets!.

A workaround is to use JRuby on Windows since it implements the socket API on top of Java NIO.

iconara commented 10 years ago

The stack trace from Windows show something unexpected that probably masks the real error: could you run again and print out the value of the e variable on line 29 in cql/client/connection_helper.rb? It should be an exception, but seems to be something else (which could be nil, which is unexpected and should be handled, but at least not completely weird).

SimonKaluza commented 10 years ago

Thanks for the quick feedback! I debugged all around the ConnectionHelper.connect method and the results were peculiar. It seems that even though connected_connections.empty? always returns true, there's never an exception returned from connections.first.error. e is always nil, as you had guessed. Any other thoughts on where to check/dive deeper?

iconara commented 10 years ago

The error could be anywhere in the call chain, and all of it is asynchronous. If you could trace where the error occurs (and why it ends up as nil) that would be a great help.

One thing you could do is to add the following lines of code:

With that we can hopefully determine which part of the connection sequence that is failing (and why there's no error).

You could also add p error just before https://github.com/iconara/cql-rb/blob/master/lib/cql/client/connection_helper.rb#L74

SimonKaluza commented 10 years ago

I tried inserting the 3 print debug statements you mentioned, but I had to use on_failure instead of on_error for the connected and f objects (I assume that's what you meant as those classes didn't have the on_error method)...

The only one that triggered and printed to console was the connected.on_failure event, but e was still nil unfortunately.

I then set a binding.pry debug point in the connected.on_failure block as follows:

From: C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/client/connection_helper.rb @ line 112 Cql::Client::ConnectionHelper#connect_to_host:

    107: def connect_to_host(host, keyspace)
    108:   @logger.debug('Connecting to node at %s:%d' % [host, @port])
    109:   connected = @io_reactor.connect(host, @port, @connection_timeout)
    110:   connected.on_failure do |e|
    111:       p [:connected, e]
 => 112:       binding.pry
    113:   end
    114:   connected.flat_map do |connection|
    115:     initialize_connection(connection, keyspace)
    116:   end
    117: end

And found that the stack at that the moment that debug point got triggered looks like this:

Showing all accessible frames in stack (14 in total):
--
=> #0  connect_to_host <Cql::Client::ConnectionHelper#connect_to_host(host, keyspace)>
   #1 [block]   block in fail <Cql::CompletableFuture#fail(error)>
   #2 [method]  fail <Cql::CompletableFuture#fail(error)>
   #3 [method]  fail <Cql::Promise#fail(error)>
   #4 [block]   block in map <Cql::FutureCombinators#map(&block)>
   #5 [block]   block in fail <Cql::CompletableFuture#fail(error)>
   #6 [method]  fail <Cql::CompletableFuture#fail(error)>
   #7 [method]  fail <Cql::Promise#fail(error)>
   #8 [method]  closed! <Cql::Io::Connection#closed!(cause)>
   #9 [method]  close <Cql::Io::Connection#close(cause=?)>
   #10 [block]   block in close_sockets <Cql::Io::IoLoopBody#close_sockets()>
   #11 [method]  close_sockets <Cql::Io::IoLoopBody#close_sockets()>
   #12 [block]   ensure in block in start <Cql::Io::IoReactor#start()>
   #13 [block]   block in start <Cql::Io::IoReactor#start()>

And specifically, the reason for that particular nil exception is that socket.close got called without a cause parameter as follows:

From: C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb @ line 291 Cql::Io::IoLoopBody#close_sockets:

    288: def close_sockets
    289:   @sockets.each do |s|
    290:     begin
 => 291:       s.close unless s.closed?
    292:     rescue
    293:       # the socket had most likely already closed due to an error
    294:     end
    295:   end
    296: end

This defaulted the cause variable of the close method to nil.

So perhaps the problem occurs even before that particular nil exception? Does the fact that the program was attempting to call close_sockets() mean it hit something fatal even earlier than the nil exception implies?

iconara commented 10 years ago

Sorry, on_failure it is I should have tested it myself.

Thanks so much for the digging, now I think I know what the real problem is. There are two ways that #close_sockets could be called: either #stop was called on the reactor object, or the reactor loop raised an exception. In this case I believe it raised an exception, because of the "Bad file descriptor" error from your original stack trace. So the whole business with the nil error was a red herring after all. I should probably make sure that it doesn't happen, though.

In other words, the real problem is what you suspected from the start, that the socket operations cause a "Bad file descriptor" error. If you want to dig deeper into what causes that and maybe fix it then you can look at the stack trace of the $! exception object here https://github.com/iconara/cql-rb/blob/master/lib/cql/io/io_reactor.rb#L145

SimonKaluza commented 10 years ago

So I dug around in the reactor loop based on your suggestion and found that the core issue was the very first call to Cql::Io::Unblocker#read that the program makes, specifically from https://github.com/iconara/cql-rb/blob/master/lib/cql/io/io_reactor.rb#L329 .

The very first time the r Unblocker object attempts to perform the @out.read_nonblock(2**16) it raises that dreaded EBADF: Bad file descriptor. According to this bug report https://www.ruby-forum.com/topic/3554151 it seems this read_nonblock method has failed on other projects on Windows... Guess it will remain a platform limitation unless we can think of a way to replace the read_nonblock call without blocking the thread -- blech!

I'm looking into doing some kind of work around for my project in the interim, but I doubt I'll be able to put much more serious time into until after the holidays. Let me know if you think of anything that might help us work around this lame Windows shortcoming.

iconara commented 10 years ago

Hmm, I wonder if it would be ok to do a #readpartial there instead of #read_nonblock. What needs to happen is that it empties the pipe's buffer without blocking. In theory #readpartial should do exactly that. You could try chaging it and see what happened. If it works I'll have to see what the downside of doing #readpartial would be.

SimonKaluza commented 10 years ago

Yep! Good call @iconara . I can confirm that altering the Unblocker#read method to use #readpartial allowed your library to seemingly work flawlessly on Windows (I was able to connect, query, and close the connection without issue).

I did not run the cql-rb unit tests yet though.

iconara commented 10 years ago

That's good. #readpartial is a bit scary to use though, it could potentially block (although it shouldn't in this particular case since we know there's something in the pipe, but those are famous last words).

iconara commented 10 years ago

There's a branch called windows_support now that checks if it's running on Windows, but not in JRuby and installs a version of the Unblocker#read method that calls #readpartial instead of #read_nonblock. Could you test it out and confirm that it works? If it does I will merge it into master and it will get released either as v1.1.3 if you want it soon, or as v1.2.0 if you can wait a bit.

I don't know how to test this properly unfortunately, I don't have access to Windows machines to run the tests on.

iconara commented 10 years ago

I'll close this now. I don't have the resources to do the testing required. If anyone finds this and want to help out feel free, but the branch I created is hopelessly out of date. I can help pointing you in the right direction.

In general I think that if you want to run Ruby on Windows you're better off using JRuby and the problem does not exist there. The underlying problem is a bug, or limitation in MRI, not a bug in cql-rb.