thesamet / rpcz

RPC implementation for Protocol Buffers over ZeroMQ
http://code.google.com/p/rpcz/
Apache License 2.0
111 stars 42 forks source link

Synchronous RPC call sometimes does not return #9

Open dspeyer opened 9 years ago

dspeyer commented 9 years ago

I have a program based on rpcz using the synchronous stub api. Sometimes (deterministically, but following no obvious pattern) the stub simply doesn't return. The stack is:

#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff7bc941b in rpcz::sync_event::wait() ()
   from /usr/local/lib/librpcz.so
#2  0x00007ffff7bc6209 in rpcz::rpc::wait() () from /usr/local/lib/librpcz.so
#3  0x0000000000420a42 in TabletServerService_Stub::Query (
    this=0x7fffe400c480, request=..., response=0x7ffff3f568c0, 
    deadline_ms=1000) at common/gen/tabletserver.rpcz.cc:286

and then a lot of frames from my code

I'm pretty sure the server is calling reply.send (I put a print statement right before it, plus it works fine with other clients). Also, if the server weren't sending, it should still return when the timeout takes place.

I put a breakpoint on rpcz::rpc_channel_impl::handle_client_response and it didn't trigger for the problematic request (though it did for others).

The scenario that causes this is the third identical request in which the client and server are the same process. It doesn't matter whether the invocations use the same stub and rpc_channel object or create fresh ones. The requests two go through fine. I can run almost the same code with different processes (still on the same host) and that goes through fine. I haven't done really extensive tests, though, so I'm far from confident that this is the only scenario that triggers the bug.

I can post my code if you want. It's about a thousand lines and pretty messy, though. I tried to replicate the bug in a simple program, but couldn't quite get the same behavior.

Are there other tests I should be running?

Any clues on how to either debug or work around this would be greatly appreciated.

dspeyer commented 9 years ago

For what it's worth, my attempt to reproduce this in a simple program:

simple.proto:

message Number {
  required int32 value = 1;
}

service simple {
  rpc Increment(Number) returns(Number);
  rpc RunTest(Number) returns(Number);
}

simple.cc:

#include <iostream>
#include <cstdlib>
#include <rpcz/rpcz.hpp>
#include "simple.pb.h"
#include "simple.rpcz.h"

rpcz::application application;

class simpleImpl : public simple {
  virtual void Increment(const Number& request, rpcz::reply<Number> reply) {
    Number response;
    response.set_value(request.value()+1);
    reply.send(response);
  }
  virtual void RunTest(const Number& request, rpcz::reply<Number> reply) {
    simple_Stub stub(application.create_rpc_channel("tcp://localhost:5555"), true);
    Number req2;
    req2.set_value(10);
    for (int i=0; i<request.value(); i++) {
      std::cerr << "incrementing 10... ";
      Number response;
      stub.Increment(req2,&response,1000);
      std::cerr << response.value() << "\n";
    }
    reply.send(req2);
  }
};

int main(int argc, char ** argv) {
  if (argc==1) {
    rpcz::server server(application);
    simpleImpl simple_service;
    server.register_service(&simple_service);
    server.bind("tcp://*:5555");
    application.run();
  } else {
    simple_Stub stub(application.create_rpc_channel("tcp://localhost:5555"), true);
    Number request, response;
    request.set_value(std::atoi(argv[1]));
    stub.RunTest(request,&response,1000);
  }
}

Commands:

protoc simple.proto --cpp_out=.
protoc simple.proto --cpp_rpcz_out=.
g++ -o simple simple.cc simple.pb.cc  simple.rpcz.cc -lrpcz -lprotobuf -lboost_system
./simple &
./simple 10

Output:

incrementing 10... 11
incrementing 10... 11
incrementing 10... 11
incrementing 10... 11
incrementing 10... terminate called after throwing an instance of 'rpcz::rpc_error'
  what():  status: DEADLINE_EXCEEDED
Aborted (core dumped)

Note that this is not the same bug, since the stub did terminate with timeout, but it may be related.

xdmiodz commented 9 years ago

Hi,

I did some investigations on the issue. There is a limit on the number of handle threads per rpcz application. It is set via connection_manager_threads variable in applciation.hpp file. By default it is 10, but you can set it whatever you need to.

With the limit set to, e.g. 100, your example works fine. But I consider this as a workaround, since it's not clear for me, why rpcz reaches the limit and why there is no timeout on the Increment call when the limit is reached

xdmiodz commented 9 years ago

Another observations.

  1. If I create separate rpc applications for the server and the two clients, I still reach the limit but able to receive DEADLINE_EXCEED on Increment call. Looks like you can't use the same rpcz application for different clients. Don't know if it's a feature or an issue
  2. If I set the limit for the server rpcz application to 20, your example works fine. Looks like rpcz application uses the handle threads in round-robin manner and doesn't care whether they are busy or not. I assume the scenario is the following:
    • The call to RunTest occupies the first handle thread (thread 0) on server and doesn't release it until the very end
    • Subsequent 9 calls of Increment uses the rest handle threads, numbered 1,2,3,.., 9.
    • On the 10th call to Increment server's rpc application should use thread 0, but this thread is still busy with call to RunTest and you get a race.
dspeyer commented 9 years ago

That solves my problem! Thank you!

It seems like a more general fix would still be a good thing, but for my project I can just set the thread count high enough.