Closed danielzgtg closed 1 year ago
Okay, wait, so this is essentially replacing the queue and some arrays with fixed-size ones and inlining a couple of things? I would not have expected that... any ideas where exactly this comes from? Removed access checks? I would have expected the STL queue to perform a bit better, especially given that it was pre-allocated (so this speed-up shouldn't really come from the fixed size).
Me assuming that STL implementations would be at least somewhat optimized is the reason I went with those in the first place... (that and me being lazy).
The strange thing is that inlining isn't always a good thing. In ab8831c3894ad61e5c5c8df66df48fafc655492c, I inlined a bunch of methods. inlining evaluate
was a 300us pessimization. I thought the constexpr would help, but it didn't make much of a difference.
The thing with STL is that they can't make as many assumptions as you can when you're the same person consuming the class. STL uses references everywhere, where we know the heap contains copyable type. I could decide to omit the destructors and just copy the types around. I also made the pop-then-push optimization, except with a boolean detecting it instead of the usual single method. I could also decide to ignore IEEE754 and just bit_cast the floats to integers for comparison.
Looking through what I did, there are so many new assumptions that it is likely no longer portable. I would need to write this properly if this ever needs to be generalized. Things like the size being constant at least allow the compile to unroll loops. If not, it can index the images faster by storing the multiplicative constant inside the assembly instruction instead of a register, and it's even faster when it's a power of 2.
Not all the benefit from arrays comes from removed access checks. They were a major part as I saw heap and image functions in perf before, and now they are gone. Access checks are indeed something that will block autovectorization. Though, I was thinking of something else. Memory layout is the thing that matters a lot. Part of the improvement in this PR is from changing array-of-structures alg::wdt::QItem<f32> data[QUEUE_MAX]
into struct-of-arrays index_t ind[QUEUE_MAX]; i32 cost[QUEUE_MAX];
. Yes, I tested without the i32, and the memory layout was a more noticeable improvement compared to the data type. Why this happens is probably up to the processor's prediction and caching.
When I thought of replacing std::vector
with arrays, it wasn't because of the malloc
time. It was because the data should be stored in on contiguous location, not 10+ places scattered across memory. Elsewhere this kind of fragmentation is noticeable with filesystems backed by rotational hard drives. In memory it doesn't usually matter, until we get to high-performance applications like this. I once failed a question on a programming contest when I used a linked list instead of an array, and others' array solution was an order of magnitude faster even for sequential access. However, most of the improvement explored here comes from the queue, and I could not measure any difference with the arrays, but this idea may be useful for future vectorization.
Another thing I experimented with was bucket sorting the queued pixels. This was intended to address later pixels needing to be managed in the heap, while we might already know earlier they will be rejected. It seemed like I gained back the 300us I lost with another change, but further tests were inconclusive.
50% of the time is spent inside evaluate
. It is heavily branching and I can't vectorize it. Now I think I've reached the limit of this type of optimization as I don't see how to improve that function.
The strange thing is that inlining isn't always a good thing.
From what I know that could be a cache thing? Inlining usually makes code larger, so there's some balance between function calls and fitting the most frequent stuff into the cache. Even if the function is not inlined, it's likely in the cache so reasonably quick to call.
The thing with STL is that they can't make as many assumptions as you can when you're the same person consuming the class. STL uses references everywhere, where we know the heap contains copyable type.
Right, that's still a big performance improvement. Way bigger than I'd have thought.
Part of the improvement in this PR is from changing array-of-structures
alg::wdt::QItem<f32> data[QUEUE_MAX]
into struct-of-arraysindex_t ind[QUEUE_MAX]; i32 cost[QUEUE_MAX];
Ah yes, I suppose that make sense.
When I thought of replacing
std::vector
with arrays, it wasn't because of themalloc
time. It was because the data should be stored in on contiguous location, not 10+ places scattered across memory.
Right, I kind of thought that this would have a pretty minimal effect since the std::vector
data itself is contiguous. Vector vs. linked list, definitely. But I guess it might make some sense if we have a bunch of small vectors.
50% of the time is spent inside
evaluate
. It is heavily branching and I can't vectorize it. Now I think I've reached the limit of this type of optimization as I don't see how to improve that function.
Yeah, that's a bit of a problem with the wdt... which I think is also more or less the core problem of the whole "advanced" processor.
Also: Nice work.
Branch misprediction is making a lot of difference here. The advanced detector is printed in red under "Bad Speculation":
$ ./perf stat ./iptsd-perf ../iptsdump.dat
[18:49:09.246] [info] Vendor: 045E
[18:49:09.246] [info] Product: 099F
[18:49:09.246] [info] Buffer Size: 7487
[18:49:09.246] [info] Metadata:
[18:49:09.246] [info] rows=44, columns=64
[18:49:09.246] [info] width=25978, height=17319
[18:49:09.246] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[18:49:09.246] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[18:49:12.125] [info] Ran 2010 times
[18:49:12.125] [info] Total: 2866982μs
[18:49:12.125] [info] Mean: 1426.36μs
[18:49:12.125] [info] Standard Deviation: 469.11μs
[18:49:12.125] [info] Minimum: 107.627μs
[18:49:12.125] [info] Maximum: 2778.798μs
Performance counter stats for './iptsd-perf ../iptsdump.dat':
2881.99 msec task-clock # 0.999 CPUs utilized
70 context-switches # 24.289 /sec
1 cpu-migrations # 0.347 /sec
1287 page-faults # 446.566 /sec
10736721941 cycles # 3.725 GHz
29968450814 instructions # 2.79 insn per cycle
5295049467 branches # 1.837 G/sec
63390337 branch-misses # 1.20% of all branches
53556029135 slots # 18.583 G/sec
29403310113 topdown-retiring # 54.9% Retiring
12181371332 topdown-bad-spec # 22.7% Bad Speculation
7770874815 topdown-fe-bound # 14.5% Frontend Bound
4233573334 topdown-be-bound # 7.9% Backend Bound
2.883487328 seconds time elapsed
2.878614000 seconds user
0.003998000 seconds sys
Compare the basic detector which has lower mispredictions and red at "Backend Bound":
$ ./perf stat ./iptsd-perf ../iptsdump.dat
[18:56:15.989] [info] Vendor: 045E
[18:56:15.989] [info] Product: 099F
[18:56:15.989] [info] Buffer Size: 7487
[18:56:15.989] [info] Metadata:
[18:56:15.989] [info] rows=44, columns=64
[18:56:15.989] [info] width=25978, height=17319
[18:56:15.989] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[18:56:15.989] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[18:56:16.043] [info] Ran 2010 times
[18:56:16.043] [info] Total: 40777μs
[18:56:16.043] [info] Mean: 20.29μs
[18:56:16.043] [info] Standard Deviation: 5.88μs
[18:56:16.043] [info] Minimum: 19.088μs
[18:56:16.043] [info] Maximum: 168.057μs
Performance counter stats for './iptsd-perf ../iptsdump.dat':
56.94 msec task-clock # 0.992 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
1088 page-faults # 19.107 K/sec
198528743 cycles # 3.487 GHz
539366736 instructions # 2.72 insn per cycle
119000517 branches # 2.090 G/sec
479160 branch-misses # 0.40% of all branches
988875930 slots # 17.366 G/sec
500254882 topdown-retiring # 50.6% Retiring
62047117 topdown-bad-spec # 6.3% Bad Speculation
120216289 topdown-fe-bound # 12.2% Frontend Bound
306357641 topdown-be-bound # 31.0% Backend Bound
0.057411960 seconds time elapsed
0.053407000 seconds user
0.004108000 seconds sys
Or mine which is also "Backend Bound" at 11μs (I think I need to implement the better algorithm I read about to get it to 7μs):
$ ./perf stat ./engine-avx512
[(123, 642), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824)] 1000000 11326ms
Performance counter stats for './engine-avx512':
11317.15 msec task-clock # 0.999 CPUs utilized
259 context-switches # 22.886 /sec
5 cpu-migrations # 0.442 /sec
92 page-faults # 8.129 /sec
42191550072 cycles # 3.728 GHz
44069120800 instructions # 1.04 insn per cycle
2872964956 branches # 253.859 M/sec
4419970 branch-misses # 0.15% of all branches
210629936170 slots # 18.612 G/sec
96881314177 topdown-retiring # 45.2% Retiring
12389996245 topdown-bad-spec # 5.8% Bad Speculation
10884461496 topdown-fe-bound # 5.1% Frontend Bound
94138491835 topdown-be-bound # 43.9% Backend Bound
11.327838952 seconds time elapsed
11.314512000 seconds user
0.003996000 seconds sys
EDIT: Actually profiling is pointing to 17% iptsd::contact::neutral, 13% range checks, 18% iptsd::ipts::Reader::read, and 24% iptsd::contacts::ContactFinder::track mostly 17% std::min_element. Since things are well below 1ms, I don't feel a need to optimize here for the basic detector.
Out of curiosity, which compiler are you using? I noticed that clang seems to produce faster code on my desktop than gcc.
With the sample data I am using, the mean times for the advanced processor look like this:
1442.50μs
1334.05μs
1253.75μs
1025.66μs
On different sample data, the difference has been up to a millisecond.
Currently iptsd enables the libstdc++ access checks (-D_GLIBCXX_ASSERTIONS
) by default, because they have almost no impact on the basic detector and the additional checks are nice to have. A very simple optimization could be to build our packages with clang and to always disable the access checks for the advanced touch processor.
diff --git a/src/contacts/advanced/detector.cpp b/src/contacts/advanced/detector.cpp
index 550827a..e939ebe 100644
--- a/src/contacts/advanced/detector.cpp
+++ b/src/contacts/advanced/detector.cpp
@@ -1,3 +1,7 @@
+#ifdef _GLIBCXX_ASSERTIONS
+#undef _GLIBCXX_ASSERTIONS
+#endif
+
#include "detector.hpp"
#include "../neutral.hpp"
I am on gcc. The constexpr
m_kern_pp in detector.cpp doesn't even compile on clang. I had to s/constexpr/const
there. As you suggested, I did CC=clang CXX=clang++ meson --wipe build
, and reran it on 9e50f9e70a3ebcccbd1c4dbc3514b8279f548b96:
[07:20:47.375] [info] Vendor: 045E
[07:20:47.375] [info] Product: 099F
[07:20:47.375] [info] Buffer Size: 7487
[07:20:47.375] [info] Metadata:
[07:20:47.375] [info] rows=44, columns=64
[07:20:47.375] [info] width=25978, height=17319
[07:20:47.375] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:20:47.375] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:20:49.224] [info] Ran 2010 times
[07:20:49.224] [info] Total: 1839020μs
[07:20:49.224] [info] Mean: 914.94μs
[07:20:49.224] [info] Standard Deviation: 280.80μs
[07:20:49.224] [info] Minimum: 125.850μs
[07:20:49.224] [info] Maximum: 1468.382μs
The compiler does indeed produce faster binaries. I applied your bounds check patch and here is the output:
[07:25:13.548] [info] Vendor: 045E
[07:25:13.548] [info] Product: 099F
[07:25:13.548] [info] Buffer Size: 7487
[07:25:13.548] [info] Metadata:
[07:25:13.548] [info] rows=44, columns=64
[07:25:13.548] [info] width=25978, height=17319
[07:25:13.548] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:25:13.548] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:25:15.390] [info] Ran 2010 times
[07:25:15.390] [info] Total: 1831444μs
[07:25:15.390] [info] Mean: 911.17μs
[07:25:15.390] [info] Standard Deviation: 278.14μs
[07:25:15.390] [info] Minimum: 124.460μs
[07:25:15.390] [info] Maximum: 1730.592μs
One problem with clang is that, unlike gcc, it doesn't show function names for inlined functions. This is the perf after that patch and
I checked out master at 8624107747a91d9f43e84f62f27a953c71d1a78f. It did not build on clang until I replaced dump.cpp with int main() {}
. Here is the unpatched output:
[07:33:06.562] [info] Vendor: 045E
[07:33:06.562] [info] Product: 099F
[07:33:06.562] [info] Buffer Size: 7487
[07:33:06.562] [info] Metadata:
[07:33:06.562] [info] rows=44, columns=64
[07:33:06.562] [info] width=25978, height=17319
[07:33:06.562] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:33:06.562] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:33:09.674] [info] Ran 2010 times
[07:33:09.674] [info] Total: 3102096μs
[07:33:09.674] [info] Mean: 1543.33μs
[07:33:09.674] [info] Standard Deviation: 475.15μs
[07:33:09.674] [info] Minimum: 189.010μs
[07:33:09.674] [info] Maximum: 2446.472μs
Here is the patched output:
[07:34:25.648] [info] Vendor: 045E
[07:34:25.648] [info] Product: 099F
[07:34:25.648] [info] Buffer Size: 7487
[07:34:25.648] [info] Metadata:
[07:34:25.648] [info] rows=44, columns=64
[07:34:25.648] [info] width=25978, height=17319
[07:34:25.648] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:34:25.648] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:34:28.322] [info] Ran 2010 times
[07:34:28.322] [info] Total: 2663423μs
[07:34:28.322] [info] Mean: 1325.09μs
[07:34:28.322] [info] Standard Deviation: 420.22μs
[07:34:28.322] [info] Minimum: 141.820μs
[07:34:28.322] [info] Maximum: 2730.163μs
It's still takes too long on std::priority_queue::push
. STL can't break apart the object or have the pop-then-push optimization. Here is the perf graph for that:
Environment:
home@daniel-desktop3:~$ gcc --version
gcc (Ubuntu 12.2.0-3ubuntu1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
home@daniel-desktop3:~$ g++ --version
g++ (Ubuntu 12.2.0-3ubuntu1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
home@daniel-desktop3:~$ clang --version
Ubuntu clang version 15.0.7
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
home@daniel-desktop3:~$ clang++ --version
Ubuntu clang version 15.0.7
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
home@daniel-desktop3:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.10
Release: 22.10
Codename: kinetic
home@daniel-desktop3:~$ uname -a
Linux daniel-desktop3 5.19.0-31-generic #32-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 20 15:20:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
My testing data is iptsdump.dat.zip and my testing config is iptsd.conf.txt. Why is there a priority_queue is the first place? This means that this is a branch-heavy algorithm and we have been targeting average case instead of worst case. This is show by the CPU skyrocketing to 100% when I use all 10 fingers. I might someday consider implementing https://github.com/qzed/digitizer-prototype branchlessly to see if that works.
Why is there a priority_queue is the first place?
Mostly because I wasn't sure how to implement it without one, other than iterating some N times over the whole image for potentially larger N (an easy approximation would be limiting that N).
The general idea here is that we first extract "core" components, where you know that those are separate touch points. Extracting those reliably unfortunately drops some of the surrounding area, and ideally we'd like to include that because we want to properly filter out any non-finger touches. The real problem here is that the whole Gaussian fitting doesn't work well if a) there are non-Gaussian-like shapes or b) some residual values from improperly filtering out non-finger touches remain (which then skew the results).
So we essentially start of with those "core" components, then do a weighted distance transform from them to include the surrounding areas until we have fully broken down the whole sensor image into its individual components. Note that some components can be overlapping, where the costs of the WDT help. So what the WDT does is essentially compute the smallest "weighted distance" from some surrounding pixel to its nearest component (and also the label of that component). The queue is a fairly naive implementation of this, I guess, where you just a) get the next candidate pixel with the smallest cost, write it down, and add its non-evaluated neighbors to the queue.
I'm not really sure how to implement that without too much branching. As I've said, the only other idea that I had is just some iterative approach that runs across the full image a couple of times. Theoretically, that N should be quite large (essentially the longest path possible between two pixels) but in practice you can probably reduce it quite a bit. Still lots of "empty" space in each iteration, and the N is another hyperparameter to tune. This is probably the way I'd have implemented it if I'd wrote it for a GPU.
One thing that I though about ages ago was that it should be possible to: a) get the core components, b) compute a bounding box around them (just take 5 px on each side or something like that) and then process those smaller images. The "halo" around the components should be large enough to contain any other intersecting component labels, otherwise it won't be able to properly filter out things. It might also be possible to do some labeling before that with a lower threshold to extract the larger connected components and then just work on those (to ensure that intersecting components are in the same extracted frame).
I ran the same test on my surface now, this time using your data and config file @danielzgtg
env CC=gcc CXX=g++ meson setup build
[14:41:29.630] [info] Ran 2010 times
[14:41:29.630] [info] Total: 6666318μs
[14:41:29.630] [info] Mean: 3316.58μs
[14:41:29.630] [info] Standard Deviation: 1113.44μs
[14:41:29.630] [info] Minimum: 234.274μs
[14:41:29.630] [info] Maximum: 6310.144μs
env CC=gcc CXX=g++ meson setup build -Daccess_checks=false
[14:42:18.627] [info] Ran 2010 times
[14:42:18.627] [info] Total: 5806799μs
[14:42:18.627] [info] Mean: 2888.95μs
[14:42:18.627] [info] Standard Deviation: 994.28μs
[14:42:18.627] [info] Minimum: 134.039μs
[14:42:18.627] [info] Maximum: 6440.459μs
env CC=clang CXX=clang++ meson setup build
[14:41:54.287] [info] Ran 2010 times
[14:41:54.287] [info] Total: 5546825μs
[14:41:54.287] [info] Mean: 2759.61μs
[14:41:54.287] [info] Standard Deviation: 855.63μs
[14:41:54.287] [info] Minimum: 357.218μs
[14:41:54.287] [info] Maximum: 8571.065μs
env CC=clang CXX=clang++ meson setup build -Daccess_checks=false
[14:42:43.315] [info] Ran 2010 times
[14:42:43.316] [info] Total: 4434157μs
[14:42:43.316] [info] Mean: 2206.05μs
[14:42:43.316] [info] Standard Deviation: 686.58μs
[14:42:43.316] [info] Minimum: 244.878μs
[14:42:43.316] [info] Maximum: 4347.347μs
Since the advanced algorithm is gone now, I am going to close this. Nonetheless, thank you for your work on optimizing it!
This is surprising. I thought I could only gain 5%. Nobody, not even me thought this would happen. Instead we gain another 1000us giving an 50% improvement.
This PR isn't planned to be merged any time soon, unless someone really insists. The fact that I didn't bother cleaning up the code proves this. The purpose of this PR is just to show that it's possible to get this size of improvements without changing the algorithm.
Looking at the perf graph,
MyQueue::{push,pop}
is 20%, so this seems to be the end of the road for this type of optimization. I had been doing architecture-preserving, function-only optimizations that didn't touch globally-passed-around data structures much. As I mentioned in the other PR, we might need to consider fixed-length data structures that trade flexibility for performance.CPU dropped from 45% to 36% in iptsd, but still spikes to 100% when using 10 fingers. The next step I'm thinking of is trying to remove recursion/branching but that might greatly change the essence of the algorithms.
Before
See #113
After