Open dspeyer opened 9 years ago
For what it's worth, my attempt to reproduce this in a simple program:
simple.proto:
message Number {
required int32 value = 1;
}
service simple {
rpc Increment(Number) returns(Number);
rpc RunTest(Number) returns(Number);
}
simple.cc:
#include <iostream>
#include <cstdlib>
#include <rpcz/rpcz.hpp>
#include "simple.pb.h"
#include "simple.rpcz.h"
rpcz::application application;
class simpleImpl : public simple {
virtual void Increment(const Number& request, rpcz::reply<Number> reply) {
Number response;
response.set_value(request.value()+1);
reply.send(response);
}
virtual void RunTest(const Number& request, rpcz::reply<Number> reply) {
simple_Stub stub(application.create_rpc_channel("tcp://localhost:5555"), true);
Number req2;
req2.set_value(10);
for (int i=0; i<request.value(); i++) {
std::cerr << "incrementing 10... ";
Number response;
stub.Increment(req2,&response,1000);
std::cerr << response.value() << "\n";
}
reply.send(req2);
}
};
int main(int argc, char ** argv) {
if (argc==1) {
rpcz::server server(application);
simpleImpl simple_service;
server.register_service(&simple_service);
server.bind("tcp://*:5555");
application.run();
} else {
simple_Stub stub(application.create_rpc_channel("tcp://localhost:5555"), true);
Number request, response;
request.set_value(std::atoi(argv[1]));
stub.RunTest(request,&response,1000);
}
}
Commands:
protoc simple.proto --cpp_out=.
protoc simple.proto --cpp_rpcz_out=.
g++ -o simple simple.cc simple.pb.cc simple.rpcz.cc -lrpcz -lprotobuf -lboost_system
./simple &
./simple 10
Output:
incrementing 10... 11
incrementing 10... 11
incrementing 10... 11
incrementing 10... 11
incrementing 10... terminate called after throwing an instance of 'rpcz::rpc_error'
what(): status: DEADLINE_EXCEEDED
Aborted (core dumped)
Note that this is not the same bug, since the stub did terminate with timeout, but it may be related.
Hi,
I did some investigations on the issue. There is a limit on the number of handle threads per rpcz application. It is set via connection_manager_threads variable in applciation.hpp file. By default it is 10, but you can set it whatever you need to.
With the limit set to, e.g. 100, your example works fine. But I consider this as a workaround, since it's not clear for me, why rpcz reaches the limit and why there is no timeout on the Increment call when the limit is reached
Another observations.
That solves my problem! Thank you!
It seems like a more general fix would still be a good thing, but for my project I can just set the thread count high enough.
I have a program based on rpcz using the synchronous stub api. Sometimes (deterministically, but following no obvious pattern) the stub simply doesn't return. The stack is:
and then a lot of frames from my code
I'm pretty sure the server is calling reply.send (I put a print statement right before it, plus it works fine with other clients). Also, if the server weren't sending, it should still return when the timeout takes place.
I put a breakpoint on rpcz::rpc_channel_impl::handle_client_response and it didn't trigger for the problematic request (though it did for others).
The scenario that causes this is the third identical request in which the client and server are the same process. It doesn't matter whether the invocations use the same stub and rpc_channel object or create fresh ones. The requests two go through fine. I can run almost the same code with different processes (still on the same host) and that goes through fine. I haven't done really extensive tests, though, so I'm far from confident that this is the only scenario that triggers the bug.
I can post my code if you want. It's about a thousand lines and pretty messy, though. I tried to replicate the bug in a simple program, but couldn't quite get the same behavior.
Are there other tests I should be running?
Any clues on how to either debug or work around this would be greatly appreciated.