Fix search performance when creating a large number of channels

thomasives commented 1 year ago

When creating a large number of channels (~1000) in a short amount of time, pvxs can end up sending several search requests for each channel, without giving the IOC a reasonable amount of time to respond. This results in search requests going missing, delaying how long it takes for all the channels to be found. The issue is that when we send an initial search for a channel we also send search requests for channels we are waiting on. This PR fixes this by separating the sending on initial search requests and the re-sending of failed search requests. More details in the commit messages.

To demonstrate the issue I am using this example program. This program creates 1000 monitors to PV's provided by the example_ioc from ajgdls/EPICSPyClientPerformance.git. When investigating the performance on my machine using cashark[1], I get the following when using the master branch:

> tshark -r master.pcap -Y "pva.command == SEARCH" -T fields -e frame.time_relative -e pva.count | 
    awk 'BEGIN {count=0;sum=0} NR==1 {start=$1} {count+=1;sum+=$2} END {end=$1; printf "npvs=%d nframes=%d time_delta=%f s\n", sum, count, end-start}'
npvs=11931 nframes=418 time_delta=8.033979 s

We are, on average, sending ~12 search requests for each PV and it is taking ~8 s to finish finding them all. With this PR I get the following in the same situation:

> tshark -r pr.pcap -Y "pva.command == SEARCH" -T fields -e frame.time_relative -e pva.count |
    awk 'BEGIN {count=0;sum=0} NR==1 {start=$1} {count+=1;sum+=$2} END {end=$1; printf "npvs=%d nframes=%d time_delta=%f s\n", sum, count, end-start}'
npvs=1000 nframes=282 time_delta=0.037329 s

With the PR, we only send a single search request for each PV and we have reduced the time it takes to finish finding all the PV's by ~250x compared to the master branch. On my system, these pcaps are representative of what happens for each version.

@mdavidsaver Please let me know if you can think of a better way of fixing this. I'm happy to iterate on this if needs be.

Ping @ajgdls @coretl.

[1] with a small patch to add a pva.count field.

AppVeyorBot commented 1 year ago

:white_check_mark: Build pvxs 1.0.853 completed (commit https://github.com/mdavidsaver/pvxs/commit/48f43d1988 by @thomasives)

mdavidsaver commented 1 year ago

... sending several search requests for each channel, without giving the IOC a reasonable amount of time to respond.

Hmm. I guess I can see how this happens. Originally, a hurryUp() was needed after a batch of channel creations to reduce connect time. e52ae674ca70e6a3238623f0c94d0b71a4fb3067 bypassed this to always go faster. I kept forgetting about hurryUp(), and P4P has no way to express it. Consequently, prior to 89f9c54d6282e92e704c7bd2d3c23359ed9bdbc2 batching of search messages was poor (lots of small UDP packets). Now search requests and replies are batched better, but with so many PVs I think I can see how poke(true) spins the timer wheel much faster than I intended.

... I get the following ...

Have you looked to see how searches going through initialSearchBucket are being batched? eg. plotting pva.count or frame size vs. frame number for pva.command == SEARCH for each pva.direction with the wireshark GUI.

As I interpret npvs=1000 nframes=282, some batch must be happening. But an average of ~4 PVs per packet is much less than the ideal of 1400 byte packets (cf. maxSearchPayload).

Given that most overhead with handling UDP is per-packet, my goal for PVXS is to to send fewer/larger search requests and responses. Of course, a trade-off needs to be made here between efficiency and latency. Also, implicitly ease of use since the current API doesn't have a way to express a batch of channel creations. ("good, fast, cheap" appears again...)

[1] with a small patch to add a pva.count field.

This would make a useful addition to https://github.com/mdavidsaver/cashark/pulls

mdavidsaver commented 1 year ago

@mdavidsaver Please let me know if you can think of a better way of fixing this. I'm happy to iterate on this if needs be.

I find myself wondering if there would be better batching by scheduling initialSearcher with a short delay (say 100ms) as a coalescence time.

thomasives commented 1 year ago

I find myself wondering if there would be better batching by scheduling initialSearcher with a short delay (say 100ms) as a coalescence time.

I had wonder about this too but I was worried about affecting the latency when creating a monitor so I did not add a delay. As you say, if we try to improve batching by adding a delay here then there is a compromise to be made between how well we batch search requests and how much we increase the channel creation latency. I suppose I had decided (without much thought) that no increase to the latency was acceptable, but perhaps that isn't the best choice. Below is some data I have collected to help us choose a delay to add (if any).

Just as a number to keep in mind, I have measured the latency of creating a monitor and receiving the first event (using monitor_latency.cpp in [1]) and on my machine I get:

Monitor latency = 6.24228 +/- 0.0446699 (min=1.46271, max=16.9533) ms

Note, I am running the pvxs client on the same machine as the IOC so this is a best case scenario. In a more realistic situation there will be some network latency on top of this, I guess this will be something in the 10s of milliseconds range but I don't really know for sure.

We can make a rough estimate of how long we have to delay for the initialSearchBucket to have a least as many channels as fit in a single UDP packet by measuring the rate that channels get created on the main thread. I have measured this too (with create_throughput.cpp from [1], 500 samples), on my machine I get:

Channel create rate = 40297.5 +/- 1082.34 (min=9081.91, max=137333) channels per second

Just to be clear, this is how quickly the main thread adds items to the tcp_loop work queue, not adding them to the initialSearchBucket. The loop is not optimised, so we could probably go faster, but hopefully it is reasonably representative of a C++ application that is creating a lot of channels at once. A python application will almost certainly be much slower than this.

For our channel names (e.g. TEST:CALC00000) we can fit 71 channels in a search request. On my machine, it should take the application ~2 ms to add this many channels to the work queue on average, worst case it should take ~8 ms. There will also be some overhead from the tcp_loop having to actually process the work queue and add the channels to the initialSearchBucket, but this is a bit harder to measure so I haven't bothered.

This simple analysis seems to agree roughly with what I observe with wireshark:

chperframe

Here I have taken 4 captures at each delay (500 us, 1 ms, 5 ms, 10 ms, 100 ms) and plotted the channels per frame for each capture against the delay. The labels are the minimum and maximum number of frames for a given delay.

For 100 ms we are optimal at only 15 frames (1000 = 14 * 71 + 6). At 10 ms (which is near our worst case limit of ~8 ms), we are only occasionally not optimal with only one of the four captures taking 16 frames. At a 5 ms delay (which is around our average time of ~2 ms) we are pretty close to optimal with a spread of 15 - 19 frames. Even with a 500 us we are doing much better than no delay, so perhaps even this is acceptable.

This is all pretty small sample size measurements and we are using a pretty silly example program, so all this should not be taken too seriously. And we need to bear in mind that a python application will be creating channels at a slower rate, so we would need a longer delay to get the same affects as we see here. Despite these shortcomings, I think a delay of 5 ms is probably an acceptable overhead to add to the latency of creating a channel and it looks like it would help a lot with batching channels when creating a lot at once, even for much slower python applications.

@mdavidsaver What do you think? Would you compromise differently?

[1] https://gist.github.com/thomasives/1df7f1c668a465e8201819434e5b5112

thomasives commented 1 year ago

Given that most overhead with handling UDP is per-packet, my goal for PVXS is to to send fewer/larger search requests and responses. Of course, a trade-off needs to be made here between efficiency and latency.

Just for completeness, I think the ideal would be some sort of adaptive delay where if the application is creating a lot of channels, we wait to batch the search requests whereas if they are only creating one or two we do not delay and prioritise latency. I think we probably could get what is effectively an adaptive delay if we had some priority work queue to schedule things with. However, I think it would increase the complexity of pvxs by a lot and there would be lots of edge cases to work out. Not sure that this is really a route we want to go down as it will be a lot of work and it is unclear how well this would work in practice.

Also, implicitly ease of use since the current API doesn't have a way to express a batch of channel creations.

I think if we want different use cases to perform optimally (even if we have some clever adaptive system), then we ultimately need to give more control to the application so they can tell us what they are trying to do. However, we might not need to do something as hard to use well as e.g. a BatchedMonitorBuilder. Perhaps we could get away with just allowing the application to set this delay somehow. For example we could just add a beginBatching() and endBatching() functions which respectively increase and decrease the initialSeracher delay to some fix values. I would perhaps leave it for now however and see how this performs in the real world.

This would make a useful addition to https://github.com/mdavidsaver/cashark/pulls

https://github.com/mdavidsaver/cashark/pull/13

mdavidsaver commented 1 year ago

Here I have taken 4 captures at each delay (500 us, 1 ms, 5 ms, 10 ms, 100 ms) ...

Based on this, I'm convinced that 100ms is excessive. I'd like to start with 10ms on the theory that your tests are done on a host which is more performant than average.

... I think the ideal would be some sort of adaptive delay ...

During the recent codeathon at DLS, @ralphlange and I spent some time troubleshooting an odd performance inversion he sees with libca where a large client takes hours to reconnect through cagateway. It turns out that libca searching attempts to adaptively throttle broadcast bandwidth by limiting the number of search packets sent based on the fraction of positive responses received recently.

I see this as an example of how tricky "adaptive" can be. Also the fact that the searching algorithms of libca still have issues decades on is an indication that there is no optimal solution to this problem.

... Perhaps we could get away with just allowing the application to set this delay somehow. ...

As future work, I'm thinking about adding a search policy selection when creating client context and/or channel. The options would be "low latency" (the default) to put the newly created channel in initialSearchBucket, vs. "constant load" which would put it into a random or least full bucket.

When talking about "connect 10000" PVs, I'm used to thinking about non-interactive clients like a data archiver system where (historically) it was considered impolite to slam the network with so many UDP broadcasts all at once.

thomasives commented 1 year ago

Sorry I accidentally closed the PR when I tried to rebase using the github interface. I did not realise the button I pressed would do that.

I have added a 10 ms delay to the initial search as you suggested.

I see this as an example of how tricky "adaptive" can be. Also the fact that the searching algorithms of libca still have issues decades on is an indication that there is no optimal solution to this problem.

Indeed, this is a hard problem to implement correctly.

ralphlange commented 1 year ago

See https://github.com/epics-base/epics-base/issues/372 for the search performance issue seen on CA that @mdavidsaver mentioned above.

AppVeyorBot commented 1 year ago

:white_check_mark: Build pvxs 1.0.866 completed (commit https://github.com/mdavidsaver/pvxs/commit/fa6b52c1f7 by @thomasives)

mdavidsaver commented 1 year ago

Sorry I accidentally closed the PR ...

It looks like you were able to straighten this out. Although for some reason I had to re-enable the GHA runs. Last I checked I only had to do this once per account, but then github.com is a moving target.

Anyway, this change looks ok to me.

AppVeyorBot commented 1 year ago

:white_check_mark: Build pvxs 1.0.867 completed (commit https://github.com/mdavidsaver/pvxs/commit/cf47f82b21 by @thomasives)

thomasives commented 1 year ago

I have been looking at the test failures on OS X and I am struggling to understand how my changes could have caused this. Unfortunately, I don't have access to an OS X machine, so it is hard for me to investigate.

The test that is failing is testsock.cpp:test_udp, we are getting an EAGAIN (errno = 35 on OS X) from a recvmsg syscall. I don't quite understand how this is possible because I do not see where the socket gets marked as SOCK_NONBLOCK/O_NONBLOCK or a SO_RCVTIMEO is set. Without these options the socket should block until data is received (and we receive nothing as ret=-1). I am basing this on Linux/FreeBSD man pages, however, so perhaps the OS X socket implementation has different behaviour...

Regardless of why this is happening, I don't see how my changes can cause it as this is testing the evhelper classes which are at a lower level than my changes.

@mdavidsaver any ideas?

mdavidsaver commented 1 year ago

The issue with testsock is known. cf. https://github.com/mdavidsaver/pvxs/issues/23 When writing it I apparently made incorrect assumptions about how buffering is done by an OS behind the sockets interface. I haven't been able to come up with a more robust way to test the behavior I want which I have time to implement.

mdavidsaver commented 1 year ago

... I do not see where the socket gets marked as SOCK_NONBLOCK/O_NONBLOCK ...

In evsocket::evsocket() via a libevent wrapper.

https://github.com/mdavidsaver/pvxs/blob/c8f28f373af6208a9b23c04a0201a61ba161a113/src/evhelper.cpp#L393

mdavidsaver commented 1 year ago

Fixing testsock would be a separate PR.

I've been working with the changes in this PR, which so far work as expected. I found one mistake I made before, which this PR perpetuates, where a discovery tickSearch() incorrectly reschedules searchTimer. I'm also looking at changing poke(). To remove the force argument so the hold off time always applies. And instead of moving up one bucket, to run through the entire bucket ring once at a faster rate.

mdavidsaver commented 1 year ago

Merged as of fe5a35eb022712118dc268e30c63dd2f1927efff. Thanks!

mdavidsaver / pvxs

Fix search performance when creating a large number of channels #39