neolinsu commented 1 year ago

Hi all,

I find the synthetic in Caladan endures a high Never-Send rate (above 1%) when clients issue requests with a relatively high rate which is close to server’s capacity. This is especially problematic under a Poisson distribution: when two adjacent requests are generated within a short time window (i.e., a bursty period), the latter one is more likely to be droped due to the Never-Send logic (see code). We have profiled Caladan's client logic, and find that the scheduling often causes the request to be delayed (which has already violated the Poisson distribution) and finally be dropped.

We further designed an experiment to confirm this. We modify the Caladan client with the scheduling policy disabled: specifically, workers are bound to different cores, which can execute send, do_softirq (directpath), handle_timeout, and recv in cycles without yielding. We equip Caladan server with 4 kthreads and launch 16 client workers (each of which owns a TCP connection) to generate requests w/ a Poisson distribution and vary the request rate (last for 32 seconds). The following table shows the experiment result:

Client Type	Thoughtput(pps)	P50 (us)	P90 (us)	P999 (us)	Never Send
synthetic	0.75M	13.4	23.1	40.0	1.39%
client w/o sched	0.75M	9.809091	18.500909	50.903636	0.000442%
synthetic	0.8M	13.6	22.6	37.9	1.4486%
client w/o sched	0.8M	9.49	16.81	584.52	0.000430%
synthetic	1M	13.6	21.9	38.6	1.6726%
client w/o sched	1M	9.64	17.59	2841.83	0.000694%
synthetic	1.1M	13.5	21.1	55.5	1.7345%
client w/o sched	1.1M	10.6	21.9	5177.75	0.000781%

joshuafried commented 1 year ago

Thanks! Indeed, with insufficient resources the client itself can become the bottleneck. We typically run the load generator with spinning kthreads - see here and many cores. When one client is insufficient to generate load, we typically use multiple machines. What are the details of your machine, and what configurations are you using for your client?

neolinsu commented 1 year ago

The cpu is Intel Xeon 2.20GHz with 20 Hyper-Threadings (10 Phy cores), which are set to performance mode. The network is 100Gb RDMA. The configuration I use for both two clients is the same as (replace some useless info):

host_addr 10.100.100.103
host_netmask 255.255.255.0
host_gateway 10.100.100.1
runtime_kthreads 16
runtime_guaranteed_kthreads 16
runtime_spinning_kthreads 16
host_mac X
disable_watchdog true
runtime_qdelay_us 10
runtime_priority lc
static_arp 10.100.100.102 X
static_arp 10.100.100.103 X
enable_directpath fs
directpath_pci X

I also notice that even client stresses at a low throughput (like 0.75 M) where resources should be sufficient, Never Send rate is still above 1%.

joshuafried commented 1 year ago

Can you post the output of a client here (and the parameters used to launch it)? Looking at some recent runs I see that even at 1MPPS my never sent rate is < .1%

neolinsu commented 1 year ago

Here is example to run 0.8M throughput.

synthetic --config synthetic.config 10.100.100.102:5190  --output=buckets --protocol memcached --mode runtime-client --threads 16 --runtime 32 --barrier-peers 1 --barrier-leader node151 --distribution=exponential --mpps=0.8 --samples=1 --transport tcp --nvalues=3200000

And synthetic's result is:

Distribution, Target, Actual, Dropped, Never Sent, Median, 90th, 99th, 99.9th, 99.99th, Start
exponential, 788411, 788411, 0, 326090, 13.6, 22.6, 33.2, 37.3, 37.9, 0, 8510673237596225

joshuafried commented 1 year ago

Hm, that is quite high. Can you post a log with many samples at lower loads (change the above command to --samples 20). Also can you try reducing the number of kthreads to 8 and see if that has any impact?

neolinsu commented 1 year ago

Can you post the output of a client here (and the parameters used to launch it)? Looking at some recent runs I see that even at 1MPPS my never sent rate is < .1%

I run server with

runtime_kthreads 4
runtime_guaranteed_kthreads 0
runtime_spinning_kthreads 0

It makes the cores mwait.

Would you plz share your configuration of server?

joshuafried commented 1 year ago

The server had 20 kthreads (20 guaranteed, 0 spinning). Does varying the server configuration impact the client behavior here?

neolinsu commented 1 year ago

The server had 20 kthreads (20 guaranteed, 0 spinning). Does varying the server configuration impact the client behavior here?

Yes. I think 20 kthreads can handle 1M pps.

You can try my configuration.

neolinsu commented 1 year ago

The point here is not how many guaranteed kthreads the Caladan server used. Instead, given the number of guaranteed kthreads (say 4 cores), we send client requests at a rate that is close (but lower) to the maximum capacity that the Caladan server can handle (say 1 Mpps), in this setup, no matter how many physical cores are used at client machines (even one core per connection), Never-Send rate is always high. As a result, the generated requests exhibit a distribution which is less bursty as expected.

W/ our modified clients (disable scheduling and let the softirq process one packet at a time), Never-Send rate is low. At this time, the generated requests follow a distribution that is more consistent to a Poisson distribution, but Caladan’s P999 latency becomes much higher.

joshuafried commented 1 year ago

Does this behavior change if you use many connections to the server? Say 100?

joshuafried commented 1 year ago

I'm trying to understand where the source of the delay is coming from that is causing so many never-sent packets. Please correct me if I am wrong in understanding the scenario here: the server machine is being tested at a load point close to its peak throughput. The client process/machine is not at full utilization and is not a bottleneck. Does this seem correct?

neolinsu commented 1 year ago

I'm trying to understand where the source of the delay is coming from that is causing so many never-sent packets. Please correct me if I am wrong in understanding the scenario here: the server machine is being tested at a load point close to its peak throughput. The client process/machine is not at full utilization and is not a bottleneck. Does this seem correct?

Yes, this is correct

neolinsu commented 1 year ago

Does this behavior change if you use many connections to the server? Say 100?

It seems the Never Send rate becomes higher as # of connections grows.

joshuafried commented 1 year ago

I'd be interested in trying to reproduce these results since they generally don't match what I've seen in my setup so far. Can you provide me the commit numbers that you are running on for caladan and mecached, the configuration files for both clients and server, the launch parameters and output logs for the iokernel, memcached, and the loadgen instances?

neolinsu commented 1 year ago

Configs for Replay

caladan-all: 37a3822be053c37275f0aefea60da26246fd01cb

Client

cmd

synthetic --config synthetic.config 10.100.100.102:5190  --output=buckets --protocol memcached --mode runtime-client --threads 16 --runtime 32 --barrier-peers 1 --barrier-leader node151 --distribution=exponential --mpps=0.8 --samples=1 --transport tcp --nvalues=3200000

Configuration

host_addr 10.100.100.103
host_netmask 255.255.255.0
host_gateway 10.100.100.1
runtime_kthreads 16
runtime_guaranteed_kthreads 16
runtime_spinning_kthreads 16
host_mac X
disable_watchdog true
runtime_qdelay_us 10
runtime_priority lc
static_arp 10.100.100.102 X
static_arp 10.100.100.103 X
enable_directpath fs
directpath_pci X

Server

cmd

memcached memcached.config -t 16 -U 5190 -p 5190 -c 32768 -m 32000 -b 32768 -o hashpower=25,no_hashexpand,lru_crawler,lru_maintainer,idle_timeout=0,slab_reassign

Configuration

host_addr 10.100.100.102
host_netmask 255.255.255.0
host_gateway 10.100.100.1
runtime_kthreads 4
runtime_guaranteed_kthreads 4
host_mac X
disable_watchdog true
runtime_qdelay_us 10
runtime_priority lc
static_arp 10.100.100.102 X
static_arp 10.100.100.103 X
enable_directpath fs
directpath_pci X

Other Setups

We keep the kthreads on the same NUMA node which is connected with the RDMA NIC.
The CPU cores where iokerneld and server workers run have been set to ioslcpus and nohz_full lists in the boot-up cmd.
We run iokerneld with ias policy.

neolinsu commented 1 year ago

I'd be interested in trying to reproduce these results since they generally don't match what I've seen in my setup so far.

Would you like to share your configuration and result? Specially, the Never Send when the request rate is close to the maximum capacity that the Caladan server can handle.

joshuafried commented 1 year ago

Can you also share the outputs/logs from the various programs that you've launched? Also, caladan-all @ 37a3822b points to caladan @ 4a254bf, though I see some of your configurations imply a later version of caladan (ie using the directpath_pci config etc). Can you please confirm the version that you are running, and whether there are any modifications that are made to it?

neolinsu commented 1 year ago

Can you also share the outputs/logs from the various programs that you've launched? Also, caladan-all @ 37a3822b points to caladan @ 4a254bf, though I see some of your configurations imply a later version of caladan (ie using the directpath_pci config etc). Can you please confirm the version that you are running, and whether there are any modifications that are made to it?

We use Caladan @ 1ab79505 and memcached from caladan-all @ 37a3822b.

shenango / caladan

P999 may not be accurate if Never Send is high. #15

Configs for Replay

Client

Server

Other Setups