Throughput at 128 byte level in benchmarks low

TylerADavis commented 7 years ago

Currently, we are only seeing throughput of about 20 MB/s on 128 byte puts (before the introduction of the new interface). We should be seeing speeds of about 1 GB/S

Current speeds: (MB/s, messages/s) 128 bytes: 20.7 MB/s, 162072 4K bytes: 556.371 MB/s, 135833 50K bytes: 2445.7 MB/s, 47767.9 1M bytes: 4442e MB/s, 4236.22 10M bytes: 4369.74 MB/s, 416.731

TylerADavis commented 7 years ago

Using TCP on a mac with server and client on the same machine, at the 128 byte level I get 4810 messages per second, and a throughput of .6 MB per second. This is significantly faster than the Amazon benchmark results as well as the results I've seen on linux, though much slower than the RDMA results above that we had also deemed "too slow".

128 bytes: 4800 msg/s, .6 MB/s 4k bytes: 3900 msg/s, 16 MB/s 50k bytes: 1120 msg/s, 57 MB/s

jcarreira commented 7 years ago

I got the following output running the 1_1 benchmark in my laptop.

Measuring latencies..
[INFO] - 1499606037284 returned from write async
[INFO] - 1499606037284 Client sending size:  192
[INFO] - 1499606037284 Client sending main message
[INFO] - 1499606037284 message pair sent by client
[INFO] - 1499606037285 Poll returned with status:  1
[INFO] - 1499606037285 Non read event on socket:  3
[INFO] - 1499606037285 Processing socket:  4
[INFO] - 1499606037285 Server received size from client
[INFO] - 1499606037285 Server received incoming size of  192
[INFO] - 1499606037285 Server read()ing
[INFO] - 1499606037323 Server received  192  bytes of  192
[INFO] - 1499606037323 Server received full message from client
[INFO] - 1499606037323 Server checking type of message
[INFO] - 1499606037323 Server processing write request.
[INFO] - 1499606037323 Server sent size.
[INFO] - 1499606037323 Server sent ack of size:  64
[INFO] - 1499606037323 Server done processing message from client
[INFO] - 1499606037323 Server calling poll.
[INFO] - 1499606037323 Client read  4  bytes of 4
[INFO] - 1499606037323 Size of incoming message received from server:  64
[INFO] - 1499606037323 read()ing
[INFO] - 1499606037363 Client has read  64  of  64  bytes.
[INFO] - 1499606037363 client done processing message
[INFO] - 1499606037363 Poll returned with status:  1
[INFO] - 1499606037363 client waiting for message from server
[INFO] - 1499606037363 Non read event on socket:  3
[INFO] - 1499606037363 Processing socket:  4
[INFO] - 1499606037363 Server received size from client
[INFO] - 1499606037363 Server received incoming size of  192

It seems that the read calls take roughly 40ms.

TylerADavis commented 7 years ago

I found this stack overflow post, the solution seems to suggest one potential issue we could be facing. I'll look into it more.

https://stackoverflow.com/questions/32274907/why-does-tcp-socket-slow-down-if-done-in-multiple-system-calls

TylerADavis commented 7 years ago

As an update, I disabled Nagle's algorithm as recommended in the stack overflow post and got a significant speedup. The TCP setup is now seeing a throughput of 4000 messages per second at the 128 byte level, vs the previous 13. I'm now continuing to debug the segfault

TylerADavis commented 7 years ago

Current Speeds on linux:

128 bytes: 4200 msg/s, .54 MB/s 4k bytes: 3300 msg/s, 13 MB/s 50k bytes: 1000 msg/s, 50 MB/s 1 M byes: 70 msg/s, 73 MB/s

jcarreira commented 6 years ago

4200 messages per second means ~238us per message. Seems reasonable for a TCP stack.

The throughputs are terrible though. We should be getting 1 or 2 orders of magnitude higher than this.

TylerADavis commented 6 years ago

Right now the messages are being sent in two separate calls to write(), one with the size of the message, and then one with the message itself. I can try combining these two into one call, but that will require an additional memory copy operation. That doesn't seem like it would provide the order of magnitude improvement however.

jcarreira commented 6 years ago

I would start by computing the throughput of the second write (the one with the contents of the object being written).

TylerADavis commented 6 years ago

I used the instruments profiling tool on MacOS to profile the runtime of the throughput benchmark at the 128 byte level, and it seems that a significant amount of time is spent in the creation of the cirrus::PosixSemaphore s for the operation, and especially in the random number generator. I'll see what I can do to speed this up.

Weight is the total time for a symbol and all its children, and self weight is the amount of time that a specific symbol spent running.

Very little time was actually spent sending or receiving messages. Process_received() spent ~175 ms in the read sys call, and Process_send spent about 200 ms sending.

TylerADavis commented 6 years ago

I profiled the tcpserver as well, and it did not have any significant performance bottlenecks. The majority of its time was spent in calls to poll which makes sense as that is where it waits for io. It then spent ~161 ms sending and ~98 ms calling read. None of the message processing code, that is, copying it out of the flat buffer and inserting into the map, seemed to take a significant amount of time.

jcarreira commented 6 years ago

How much data is being transferred? What is the throughput of the write()s and read()s? Does it match network bandwidth?

TylerADavis commented 6 years ago

I'll look into that

TylerADavis commented 6 years ago

How would you recommend going about finding this? The tool I used to profile execution time does not have any way to find total data sent or throughput unfortunately.

I attempted to use logging statements around the send operations and receive operations in the TCPClient, but I'm not entirely sure that the results reflect the actual throughput of the operations as my understanding is that a send() call returns as soon as data is placed into the send buffer, and the only time that a receive call will delay is if it is called before the data is in the receive buffer. Additionally, I believe the time interval on the logging statements is 1ms, which is significantly longer than the send operations are taking, so it is not possible to tell the time interval between send operations.

Here are the results I found by counting the number of bytes sent per send / number of sends/receives per second:

Receives: Twelve in a period of 3 ms for 4 full messages received per ms, or about 250 microseconds for each receive, which is in line with what you'd calculated previously. Each ack message is 68 bytes long. However, this is limited by the rate at which the server receives requests, which is limited by the slow nature of the code to generate random names for semaphores.

Sends: Twelve full messages sent in a period of 3 ms for a total of 4 per ms, the same as receives. Each of these is 196 bytes.

Serverside processing: In the span of one ms, the server also processes and responds to four requests. Due to the 1ms granularity

TylerADavis commented 6 years ago

As for the issue with the slow time to generate random names for semaphores, I've got a few ideas:

1) Find a faster random number generator, and see if that improves the runtime 2) See if there is some way to get away with not having to re seed the generator each time (though this seems to take barely any time) 3) Do away with named semaphores entirely, and instead use posix unnamed semaphores on linux and some other semaphore on mac (SYSV? GCD?)

I think the first of these options sounds like the best to start with, but let me know if you have any suggestions.

TylerADavis commented 6 years ago

@jcarreira I've opened #90 Which increases throughput by ~4.5x with optimizations (2.75x without) , but this is still shy of the order of magnitude you'd wanted.

TylerADavis commented 6 years ago

As an update to RDMA speeds, I reran the benchmark on master and at the 128 byte level saw throughputs of only 9000 msg/s and 1.18 MB/s. Something seems off with this because this is half the msg/s of the latest TCP version, but has 10x the bytes/s. I'll look into it more.

TylerADavis commented 6 years ago

It turns out in the most recent versions of throughput, the size that was being used was that of the unique_ptr and not the array, I've rectified this in the latest push on #90 .

jcarreira commented 6 years ago

I've sent you an email about this.

jcarreira / cirrus-kv

Throughput at 128 byte level in benchmarks low #46