Closed SimonKaluza closed 10 years ago
Does the library currently rely on POSIX-specific sockets? Just occurred to me that "Cluster disconnect failed: Bad file descriptor" smacks of POSIX standards and those types of sockets might be nonexistent in a Windows environment.
cql-rb uses the standard Ruby Socket API (Socket
and IO.select
basically). I don't know how those work on Windows (poorly it seems from your report).
It would require too much effort for me to fix this, I wouldn't even know where to start, but I'd be very grateful for patches. The relevant code is in the Cql::Io
module, specifically Cql::Io::Connection
and Cql::Io::IoLoopBody#check_sockets!
.
A workaround is to use JRuby on Windows since it implements the socket API on top of Java NIO.
The stack trace from Windows show something unexpected that probably masks the real error: could you run again and print out the value of the e
variable on line 29 in cql/client/connection_helper.rb
? It should be an exception, but seems to be something else (which could be nil
, which is unexpected and should be handled, but at least not completely weird).
Thanks for the quick feedback! I debugged all around the ConnectionHelper.connect
method and the results were peculiar. It seems that even though connected_connections.empty?
always returns true, there's never an exception returned from connections.first.error
. e
is always nil
, as you had guessed. Any other thoughts on where to check/dive deeper?
The error could be anywhere in the call chain, and all of it is asynchronous. If you could trace where the error occurs (and why it ends up as nil) that would be a great help.
One thing you could do is to add the following lines of code:
connected.on_error { |e| p [:connected, e] }
just before https://github.com/iconara/cql-rb/blob/master/lib/cql/client/connection_helper.rb#L105identified.on_error { |e| p [:identified, e] }
just before https://github.com/iconara/cql-rb/blob/master/lib/cql/client/connection_helper.rb#L114f.on_error { |e| p [:execute_identify, e] }
just before https://github.com/iconara/cql-rb/blob/master/lib/cql/client/connection_helper.rb#L113With that we can hopefully determine which part of the connection sequence that is failing (and why there's no error).
You could also add p error
just before https://github.com/iconara/cql-rb/blob/master/lib/cql/client/connection_helper.rb#L74
I tried inserting the 3 print debug statements you mentioned, but I had to use on_failure
instead of on_error
for the connected
and f
objects (I assume that's what you meant as those classes didn't have the on_error
method)...
The only one that triggered and printed to console was the connected.on_failure
event, but e
was still nil
unfortunately.
I then set a binding.pry
debug point in the connected.on_failure
block as follows:
From: C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/client/connection_helper.rb @ line 112 Cql::Client::ConnectionHelper#connect_to_host:
107: def connect_to_host(host, keyspace)
108: @logger.debug('Connecting to node at %s:%d' % [host, @port])
109: connected = @io_reactor.connect(host, @port, @connection_timeout)
110: connected.on_failure do |e|
111: p [:connected, e]
=> 112: binding.pry
113: end
114: connected.flat_map do |connection|
115: initialize_connection(connection, keyspace)
116: end
117: end
And found that the stack at that the moment that debug point got triggered looks like this:
Showing all accessible frames in stack (14 in total):
--
=> #0 connect_to_host <Cql::Client::ConnectionHelper#connect_to_host(host, keyspace)>
#1 [block] block in fail <Cql::CompletableFuture#fail(error)>
#2 [method] fail <Cql::CompletableFuture#fail(error)>
#3 [method] fail <Cql::Promise#fail(error)>
#4 [block] block in map <Cql::FutureCombinators#map(&block)>
#5 [block] block in fail <Cql::CompletableFuture#fail(error)>
#6 [method] fail <Cql::CompletableFuture#fail(error)>
#7 [method] fail <Cql::Promise#fail(error)>
#8 [method] closed! <Cql::Io::Connection#closed!(cause)>
#9 [method] close <Cql::Io::Connection#close(cause=?)>
#10 [block] block in close_sockets <Cql::Io::IoLoopBody#close_sockets()>
#11 [method] close_sockets <Cql::Io::IoLoopBody#close_sockets()>
#12 [block] ensure in block in start <Cql::Io::IoReactor#start()>
#13 [block] block in start <Cql::Io::IoReactor#start()>
And specifically, the reason for that particular nil
exception is that socket.close
got called without a cause
parameter as follows:
From: C:/Ruby193/lib/ruby/gems/1.9.1/gems/cql-rb-1.1.1/lib/cql/io/io_reactor.rb @ line 291 Cql::Io::IoLoopBody#close_sockets:
288: def close_sockets
289: @sockets.each do |s|
290: begin
=> 291: s.close unless s.closed?
292: rescue
293: # the socket had most likely already closed due to an error
294: end
295: end
296: end
This defaulted the cause
variable of the close
method to nil
.
So perhaps the problem occurs even before that particular nil
exception? Does the fact that the program was attempting to call close_sockets()
mean it hit something fatal even earlier than the nil
exception implies?
Sorry, on_failure
it is I should have tested it myself.
Thanks so much for the digging, now I think I know what the real problem is. There are two ways that #close_sockets
could be called: either #stop
was called on the reactor object, or the reactor loop raised an exception. In this case I believe it raised an exception, because of the "Bad file descriptor" error from your original stack trace. So the whole business with the nil
error was a red herring after all. I should probably make sure that it doesn't happen, though.
In other words, the real problem is what you suspected from the start, that the socket operations cause a "Bad file descriptor" error. If you want to dig deeper into what causes that and maybe fix it then you can look at the stack trace of the $!
exception object here https://github.com/iconara/cql-rb/blob/master/lib/cql/io/io_reactor.rb#L145
So I dug around in the reactor loop based on your suggestion and found that the core issue was the very first call to Cql::Io::Unblocker#read
that the program makes, specifically from https://github.com/iconara/cql-rb/blob/master/lib/cql/io/io_reactor.rb#L329 .
The very first time the r
Unblocker object attempts to perform the @out.read_nonblock(2**16)
it raises that dreaded EBADF: Bad file descriptor
. According to this bug report https://www.ruby-forum.com/topic/3554151 it seems this read_nonblock method has failed on other projects on Windows... Guess it will remain a platform limitation unless we can think of a way to replace the read_nonblock call without blocking the thread -- blech!
I'm looking into doing some kind of work around for my project in the interim, but I doubt I'll be able to put much more serious time into until after the holidays. Let me know if you think of anything that might help us work around this lame Windows shortcoming.
Hmm, I wonder if it would be ok to do a #readpartial
there instead of #read_nonblock
. What needs to happen is that it empties the pipe's buffer without blocking. In theory #readpartial
should do exactly that. You could try chaging it and see what happened. If it works I'll have to see what the downside of doing #readpartial
would be.
Yep! Good call @iconara . I can confirm that altering the Unblocker#read
method to use #readpartial
allowed your library to seemingly work flawlessly on Windows (I was able to connect, query, and close the connection without issue).
I did not run the cql-rb unit tests yet though.
That's good. #readpartial
is a bit scary to use though, it could potentially block (although it shouldn't in this particular case since we know there's something in the pipe, but those are famous last words).
There's a branch called windows_support
now that checks if it's running on Windows, but not in JRuby and installs a version of the Unblocker#read
method that calls #readpartial
instead of #read_nonblock
. Could you test it out and confirm that it works? If it does I will merge it into master and it will get released either as v1.1.3 if you want it soon, or as v1.2.0 if you can wait a bit.
I don't know how to test this properly unfortunately, I don't have access to Windows machines to run the tests on.
I'll close this now. I don't have the resources to do the testing required. If anyone finds this and want to help out feel free, but the branch I created is hopelessly out of date. I can help pointing you in the right direction.
In general I think that if you want to run Ruby on Windows you're better off using JRuby and the problem does not exist there. The underlying problem is a bug, or limitation in MRI, not a bug in cql-rb.
The following code runs fine on Ruby 1.9.3 on Linux but raises strange errors/exceptions on Ruby 1.9.3 on Windows:
The output on RHEL, which is expected is:
The broken output on Windows is:
I tried debugging in to see what caused this error, but the multithreaded nature of the promises/future failures made it difficult to pinpoint the error. Is there no support for Windows machines? If someone can point me toward a few relevant classes in the source code I'd work on a patch and submit it.