tomerfiliba / agnos

Agnos - The Agnostic RPC Framework
http://agnos.sourceforge.net
Other
14 stars 7 forks source link

Transaction speed running C++ on Linux #45

Open JeremyBowen opened 13 years ago

JeremyBowen commented 13 years ago

I've got a potential problem with speed of calls between a client and a server process running C++ code under linux.

The issue is that I can get a very low throughput in terms of number of transactions per second. We've tried C# on windows and can get 10's of thousands of calls per second when running the client/server both on the same PC (localhost) or across a network.

Under Linux running the C++ code, I can't get anywhere near that number of transactions.

I have a simple loop that performs two RPC calls 100 times and timing this with the Linux "time" command. This takes approx 16 seconds when running the client & server on the same PC. It's roughly the same across a network.

Running the Linux C++ client against the Windows C# Server results in approx the same slow behaviour. (~16s per 200 calls) Running the Windows C# client against the Linux C++ Server results in approx the same slow behaviour. Running a Linux Python client against a C++ Linux server results in the same slow behaviour (~16s per 200 calls)

It looks like the Linux C++ code is possibly waiting for a timeout or maybe there is something blocking, waiting to setup a connection. I'm guessing here but the CPU barely notices so it doesn't appear to be a lack of raw CPU cycles.

We've been unable to create a C++ Windows version to compare C++ vs C# and nor have I been able to figure out the relevant python syntax to create a python server under Linux.

If I can figure out how to attach some sample code to this issue I will do so, otherwise I'll post it here shortly.

tomerfiliba commented 13 years ago

hi, i assume you're related to @miksovic?

anyway, thank you very much for your effort, no further tests are necessary. it's clear that the c++ implementation is considerably slower... and you say it doesn't spend much time in CPU, so it seems to be waiting on locks or IO. my first thought was maybe my c++ skills were rusty and I used lots of copy-constructing or something in that spirit, but if it's not CPU bound, it's less probable to be the cause. it's more likely that i'm either doing the IO wrong, or that my Mutex class is very slow

some pointers:

a good test would be running using strace, to see where it's waiting. i'd appreciate if you could do that for me.

as for attaching files -- it would be easiest to just create a repo on github and upload your files there (that way it's revision-controlled and easy to share), or you could use http://gist.github.com (or pastebin, etc.)

and last but not least -- i'm busy over my head at the moment, it might take me a several days to get to it. any progress you guys make would be greatly appreciated (you don't have to solve the problem -- just locate it). the best way is to fork the project, commit your changes/tests, and then send me a pull request.

thanks, -tomer

JeremyBowen commented 13 years ago

Hi yes, @miksovic is primarily working on Windows and I'm working on Linux.

Thanks for your response. I'm not sure this is entirely C++ related. Maybe it's more a Linux thing. I've run strace and ltrace over both my client and server programs and don't have a huge amount of new information.

I'll post the results of my tests as separate comments:

JeremyBowen commented 13 years ago

First up is testing my Linux clients (Python and C++) against a Linux C++ server:

Python Client

$ ltrace -c -S python ./client.py % time seconds usecs/call calls function


73.34 62.395172 87388 714 SYS_select 26.60 22.630919 22630919 1 Py_Main 0.02 0.014398 15 916 SYS_open 0.01 0.007676 18 408 SYS_sendto ... real 0m8.250s user 0m0.172s sys 0m0.072s

Again with strace $ time strace -c python ./client.py % time seconds usecs/call calls errors syscall


78.64 0.000081 0 408 sendto 21.36 0.000022 0 295 248 stat 0.00 0.000000 0 221 read 0.00 0.000000 0 50 write ... real 0m8.253s user 0m0.144s sys 0m0.084s

C++ CLient

$ ltrace -c -S ./client % time seconds usecs/call calls function


27.29 12.051400 107 112473 mcount 21.32 9.411614 15253 617 _ZNSi4readEPcl 10.50 4.634701 7511 617 _ZNSt15basic_streambufIcSt11char_traitsIcEE6xsgetnEPcl 10.42 4.598853 15695 293 _ZNSt15basic_streambufIcSt11char_traitsIcEE5uflowEv 7.53 3.323090 8144 408 _ZNSo5writeEPKcl 5.43 2.397605 23505 102 _ZNSo5flushEv 4.70 2.075017 919 2257 _ZNKSt15basic_streambufIcSt11char_traitsIcEE5pbaseEv 4.67 2.061238 1260 1635 _ZNKSt9basic_iosIcSt11char_traitsIcEE3eofEv 3.29 1.453525 4960 293 recvmsg 3.22 1.421815 4803 296 SYS_recvmsg ... real 0m11.073s user 0m1.756s sys 0m5.576s Now strace on C++ client:

$ time strace -c ./client % time seconds usecs/call calls errors syscall


100.00 0.004000 19 209 recvmsg 0.00 0.000000 0 11 read 0.00 0.000000 0 51 write 0.00 0.000000 0 12 open 0.00 0.000000 0 16 close ... real 0m8.101s user 0m0.024s sys 0m0.024s

JeremyBowen commented 13 years ago

Now Linux Clients to Windows C# Server process on network machine.

Python Client on Linux

$ time ltrace -c -S python ./client.py % time seconds usecs/call calls function


72.59 60.406528 84602 714 SYS_select 27.34 22.755401 22755401 1 Py_Main 0.02 0.013669 14 916 SYS_open 0.01 0.009206 22 408 SYS_sendto ... real 0m20.849s user 0m0.164s sys 0m0.068s

$ time strace -c python ./client.py % time seconds usecs/call calls errors syscall


59.46 0.000022 0 916 755 open 40.54 0.000015 0 61 munmap 0.00 0.000000 0 221 read 0.00 0.000000 0 50 write ... real 0m20.630s user 0m0.128s sys 0m0.092s

C++ Client on Linux

$ time ltrace -c -S ./client % time seconds usecs/call calls function


19.65 45.960395 74490 617 _ZNSi4readEPcl 18.48 43.230169 70065 617 _ZNSt15basic_streambufIcSt11char_traitsIcEE6xsgetnEPcl 18.47 43.195086 423481 102 _ZNSt15basic_streambufIcSt11char_traitsIcEE5uflowEv 18.32 42.847775 420076 102 recvmsg 18.31 42.833417 407937 105 SYS_recvmsg 4.06 9.488322 90 104441 mcount 2.24 5.228938 12816 408 _ZNSo5writeEPKcl 0.16 0.377865 3704 102 _ZNSo5flushEv ... real 0m21.221s user 0m1.652s sys 0m4.892s

$ time strace -c ./client % time seconds usecs/call calls errors syscall


-nan 0.000000 0 11 read -nan 0.000000 0 51 write -nan 0.000000 0 12 open -nan 0.000000 0 16 close -nan 0.000000 0 13 fstat -nan 0.000000 0 35 mmap -nan 0.000000 0 24 mprotect ... real 0m20.486s user 0m0.016s sys 0m0.028s

JeremyBowen commented 13 years ago

Running C# -> C# on the Windows box is extremely fast by comparison although I don't have timings (too fast). C# client on Windows to C++ Server on Linux is again slow (~16 seconds for the same test)

I'm beginning to wonder if my Linux box has some sort of problem. It's obviously not restricted to C++ as both C++, C# and Python clients behave the same way, as far as I can tell, regardless of what language the server is implemented in.

I appreciate that you're busy and I don't expect you to solve this for me. I'll keep trying to identify the cause of the poor performance and I'll do anything I can to try and locate the problem.

Cheers

tomerfiliba commented 13 years ago

thanks for the effort!

-tomer

tomerfiliba commented 13 years ago

C++ to C++, linux box, modified the C++ test to include a small performance test:

// test performance
boost::posix_time::ptime t0(boost::posix_time::microsec_clock::local_time());
int tmp = 0;
const int cycles = 100;
for (int i = 0; i < cycles; i++) {
    double res = adam->think(188, i+1);
    if (res > 2) {
        tmp = 1; // to make sure this loop is not optimized out
    }
}
boost::posix_time::ptime t1(boost::posix_time::microsec_clock::local_time());
double interval = (t1 - t0).total_microseconds() / 1000000.0;
cout << "performing " << cycles << " cycles took " << interval << " seconds" << endl;

results in performing 100 cycles took 8.00376 seconds... bummer

tomerfiliba commented 13 years ago

okay, strace seems to reveal some interesting facts: first of all, boost uses sendmsg and recvmsg instead of send and recv, but it don't suppose there's performance difference between them. on the other hand, the last two recvmsg-es are extremely slow (time consumed in system call printed in brackets at the end of the line)

client side

sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000023>
sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\35", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000010>
sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000008>
sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\1\0\r\2744\0\0\0\0\10\211.0@g\200\0\0\0\0\0@\0\0\0\0\0\0\0", 29}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 29 <0.000007>
recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7\0\0\0\0\0@g\200\0\0\0\0\0\0\0\0\tIDL_MAGIC\0\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 4 <0.039792>
recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\t\0\0\0\0\0@W\200\0\0\0\0\0\0\0\0\tIDL_MAGIC\0\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 17 <0.039703>

server side

recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7\0\0\0\0\1\0\r\2744\0\0\0\0\10\211.0@g\200\0\0\0\0\0?\360\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 4 <0.039833>
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\35\0\0\0\0\1\0\r\2744\0\0\0\0\10\211.0@g\200\0\0\0\0\0@\0\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 37 <0.039705>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000024>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\t", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000011>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000009>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0@W\200\0\0\0\0\0", 9}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9 <0.000008>

both the server and the client seem to be waiting a long time (blocking on recvmsg) for incoming data... 0.04 second. this happens twice for each invocation and thus every invocation takes roughly 0.08sec, which adds up to 8 seconds per 100 calls.

JeremyBowen commented 13 years ago

I don't know if you've been working on this lately but how difficult would it be to replace the Boost socket handling with a simple implementation of a socket library.

I've been experimenting with some simple socket handling code and the strace info reveals no problems in sending/receiving times. (~8 seconds per 40,000 calls compared with 8 seconds per 100 calls)

At first glance, linking this into the rest of the C++ framework in libagnos however would require quite a bit of work as I'm not really familiar with the boundaries of the boost code and which features you are making use of.

tomerfiliba commented 13 years ago

yes, i think the problem is boost's tcp::iostream, so i started a branch using tcp::socket instead (to directly do send() and recv()). but then i wouldn't be able to use compression (boost's zlib filter), so i'll have to resort to writing my own version of tcp::iostream. it's on my todo list.

davidsj2 commented 12 years ago

Any updates on this issue?

tomerfiliba commented 12 years ago

no, i'm not actively maintaining agnos any more...