This is surprising. I thought I could only gain 5%. Nobody, not even me thought this would happen. Instead we gain another 1000us giving an 50% improvement.

This PR isn't planned to be merged any time soon, unless someone really insists. The fact that I didn't bother cleaning up the code proves this. The purpose of this PR is just to show that it's possible to get this size of improvements without changing the algorithm.

Looking at the perf graph, MyQueue::{push,pop} is 20%, so this seems to be the end of the road for this type of optimization. I had been doing architecture-preserving, function-only optimizations that didn't touch globally-passed-around data structures much. As I mentioned in the other PR, we might need to consider fixed-length data structures that trade flexibility for performance.

CPU dropped from 45% to 36% in iptsd, but still spikes to 100% when using 10 fingers. The next step I'm thinking of is trying to remove recursion/branching but that might greatly change the essence of the algorithms.

Before

See #113

[22:20:12.187] [info] Vendor:       045E
[22:20:12.187] [info] Product:      099F
[22:20:12.187] [info] Buffer Size:  7487
[22:20:12.187] [info] Metadata:
[22:20:12.187] [info] rows=44, columns=64
[22:20:12.187] [info] width=25978, height=17319
[22:20:12.187] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[22:20:12.187] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[22:20:16.228] [info] Ran 2010 times
[22:20:16.228] [info] Total: 4031356μs
[22:20:16.228] [info] Mean: 2005.65μs
[22:20:16.228] [info] Standard Deviation: 674.57μs
[22:20:16.228] [info] Minimum: 144.730μs
[22:20:16.228] [info] Maximum: 3357.086μs

After

[14:39:40.948] [info] Vendor:       045E
[14:39:40.948] [info] Product:      099F
[14:39:40.948] [info] Buffer Size:  7487
[14:39:40.948] [info] Metadata:
[14:39:40.948] [info] rows=44, columns=64
[14:39:40.948] [info] width=25978, height=17319
[14:39:40.948] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[14:39:40.948] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[14:39:42.951] [info] Ran 2010 times
[14:39:42.951] [info] Total: 1992742μs
[14:39:42.951] [info] Mean: 991.41μs
[14:39:42.951] [info] Standard Deviation: 324.57μs
[14:39:42.951] [info] Minimum: 86.030μs
[14:39:42.951] [info] Maximum: 1635.113μs

Okay, wait, so this is essentially replacing the queue and some arrays with fixed-size ones and inlining a couple of things? I would not have expected that... any ideas where exactly this comes from? Removed access checks? I would have expected the STL queue to perform a bit better, especially given that it was pre-allocated (so this speed-up shouldn't really come from the fixed size).

Me assuming that STL implementations would be at least somewhat optimized is the reason I went with those in the first place... (that and me being lazy).

The strange thing is that inlining isn't always a good thing. In ab8831c3894ad61e5c5c8df66df48fafc655492c, I inlined a bunch of methods. inlining evaluate was a 300us pessimization. I thought the constexpr would help, but it didn't make much of a difference.

The thing with STL is that they can't make as many assumptions as you can when you're the same person consuming the class. STL uses references everywhere, where we know the heap contains copyable type. I could decide to omit the destructors and just copy the types around. I also made the pop-then-push optimization, except with a boolean detecting it instead of the usual single method. I could also decide to ignore IEEE754 and just bit_cast the floats to integers for comparison.

Looking through what I did, there are so many new assumptions that it is likely no longer portable. I would need to write this properly if this ever needs to be generalized. Things like the size being constant at least allow the compile to unroll loops. If not, it can index the images faster by storing the multiplicative constant inside the assembly instruction instead of a register, and it's even faster when it's a power of 2.

Not all the benefit from arrays comes from removed access checks. They were a major part as I saw heap and image functions in perf before, and now they are gone. Access checks are indeed something that will block autovectorization. Though, I was thinking of something else. Memory layout is the thing that matters a lot. Part of the improvement in this PR is from changing array-of-structures alg::wdt::QItem<f32> data[QUEUE_MAX] into struct-of-arrays index_t ind[QUEUE_MAX]; i32 cost[QUEUE_MAX];. Yes, I tested without the i32, and the memory layout was a more noticeable improvement compared to the data type. Why this happens is probably up to the processor's prediction and caching.

When I thought of replacing std::vector with arrays, it wasn't because of the malloc time. It was because the data should be stored in on contiguous location, not 10+ places scattered across memory. Elsewhere this kind of fragmentation is noticeable with filesystems backed by rotational hard drives. In memory it doesn't usually matter, until we get to high-performance applications like this. I once failed a question on a programming contest when I used a linked list instead of an array, and others' array solution was an order of magnitude faster even for sequential access. However, most of the improvement explored here comes from the queue, and I could not measure any difference with the arrays, but this idea may be useful for future vectorization.

Another thing I experimented with was bucket sorting the queued pixels. This was intended to address later pixels needing to be managed in the heap, while we might already know earlier they will be rejected. It seemed like I gained back the 300us I lost with another change, but further tests were inconclusive.

50% of the time is spent inside evaluate. It is heavily branching and I can't vectorize it. Now I think I've reached the limit of this type of optimization as I don't see how to improve that function.

The strange thing is that inlining isn't always a good thing.

From what I know that could be a cache thing? Inlining usually makes code larger, so there's some balance between function calls and fitting the most frequent stuff into the cache. Even if the function is not inlined, it's likely in the cache so reasonably quick to call.

The thing with STL is that they can't make as many assumptions as you can when you're the same person consuming the class. STL uses references everywhere, where we know the heap contains copyable type.

Right, that's still a big performance improvement. Way bigger than I'd have thought.

Part of the improvement in this PR is from changing array-of-structures alg::wdt::QItem<f32> data[QUEUE_MAX] into struct-of-arrays index_t ind[QUEUE_MAX]; i32 cost[QUEUE_MAX];

Ah yes, I suppose that make sense.

When I thought of replacing std::vector with arrays, it wasn't because of the malloc time. It was because the data should be stored in on contiguous location, not 10+ places scattered across memory.

Right, I kind of thought that this would have a pretty minimal effect since the std::vector data itself is contiguous. Vector vs. linked list, definitely. But I guess it might make some sense if we have a bunch of small vectors.

50% of the time is spent inside evaluate. It is heavily branching and I can't vectorize it. Now I think I've reached the limit of this type of optimization as I don't see how to improve that function.

Yeah, that's a bit of a problem with the wdt... which I think is also more or less the core problem of the whole "advanced" processor.

Branch misprediction is making a lot of difference here. The advanced detector is printed in red under "Bad Speculation":

$ ./perf stat ./iptsd-perf ../iptsdump.dat 
[18:49:09.246] [info] Vendor:       045E
[18:49:09.246] [info] Product:      099F
[18:49:09.246] [info] Buffer Size:  7487
[18:49:09.246] [info] Metadata:
[18:49:09.246] [info] rows=44, columns=64
[18:49:09.246] [info] width=25978, height=17319
[18:49:09.246] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[18:49:09.246] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[18:49:12.125] [info] Ran 2010 times
[18:49:12.125] [info] Total: 2866982μs
[18:49:12.125] [info] Mean: 1426.36μs
[18:49:12.125] [info] Standard Deviation: 469.11μs
[18:49:12.125] [info] Minimum: 107.627μs
[18:49:12.125] [info] Maximum: 2778.798μs

 Performance counter stats for './iptsd-perf ../iptsdump.dat':

           2881.99 msec task-clock                       #    0.999 CPUs utilized          
                70      context-switches                 #   24.289 /sec                   
                 1      cpu-migrations                   #    0.347 /sec                   
              1287      page-faults                      #  446.566 /sec                   
       10736721941      cycles                           #    3.725 GHz                    
       29968450814      instructions                     #    2.79  insn per cycle         
        5295049467      branches                         #    1.837 G/sec                  
          63390337      branch-misses                    #    1.20% of all branches        
       53556029135      slots                            #   18.583 G/sec                  
       29403310113      topdown-retiring                 #     54.9% Retiring              
       12181371332      topdown-bad-spec                 #     22.7% Bad Speculation       
        7770874815      topdown-fe-bound                 #     14.5% Frontend Bound        
        4233573334      topdown-be-bound                 #      7.9% Backend Bound         

       2.883487328 seconds time elapsed

       2.878614000 seconds user
       0.003998000 seconds sys

Compare the basic detector which has lower mispredictions and red at "Backend Bound":

$ ./perf stat ./iptsd-perf ../iptsdump.dat
[18:56:15.989] [info] Vendor:       045E
[18:56:15.989] [info] Product:      099F
[18:56:15.989] [info] Buffer Size:  7487
[18:56:15.989] [info] Metadata:
[18:56:15.989] [info] rows=44, columns=64
[18:56:15.989] [info] width=25978, height=17319
[18:56:15.989] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[18:56:15.989] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[18:56:16.043] [info] Ran 2010 times
[18:56:16.043] [info] Total: 40777μs
[18:56:16.043] [info] Mean: 20.29μs
[18:56:16.043] [info] Standard Deviation: 5.88μs
[18:56:16.043] [info] Minimum: 19.088μs
[18:56:16.043] [info] Maximum: 168.057μs

 Performance counter stats for './iptsd-perf ../iptsdump.dat':

             56.94 msec task-clock                       #    0.992 CPUs utilized          
                 0      context-switches                 #    0.000 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
              1088      page-faults                      #   19.107 K/sec                  
         198528743      cycles                           #    3.487 GHz                    
         539366736      instructions                     #    2.72  insn per cycle         
         119000517      branches                         #    2.090 G/sec                  
            479160      branch-misses                    #    0.40% of all branches        
         988875930      slots                            #   17.366 G/sec                  
         500254882      topdown-retiring                 #     50.6% Retiring              
          62047117      topdown-bad-spec                 #      6.3% Bad Speculation       
         120216289      topdown-fe-bound                 #     12.2% Frontend Bound        
         306357641      topdown-be-bound                 #     31.0% Backend Bound         

       0.057411960 seconds time elapsed

       0.053407000 seconds user
       0.004108000 seconds sys

Or mine which is also "Backend Bound" at 11μs (I think I need to implement the better algorithm I read about to get it to 7μs):

$ ./perf stat ./engine-avx512 
[(123, 642), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824), (2736, 1824)] 1000000 11326ms

 Performance counter stats for './engine-avx512':

          11317.15 msec task-clock                       #    0.999 CPUs utilized          
               259      context-switches                 #   22.886 /sec                   
                 5      cpu-migrations                   #    0.442 /sec                   
                92      page-faults                      #    8.129 /sec                   
       42191550072      cycles                           #    3.728 GHz                    
       44069120800      instructions                     #    1.04  insn per cycle         
        2872964956      branches                         #  253.859 M/sec                  
           4419970      branch-misses                    #    0.15% of all branches        
      210629936170      slots                            #   18.612 G/sec                  
       96881314177      topdown-retiring                 #     45.2% Retiring              
       12389996245      topdown-bad-spec                 #      5.8% Bad Speculation       
       10884461496      topdown-fe-bound                 #      5.1% Frontend Bound        
       94138491835      topdown-be-bound                 #     43.9% Backend Bound         

      11.327838952 seconds time elapsed

      11.314512000 seconds user
       0.003996000 seconds sys

EDIT: Actually profiling is pointing to 17% iptsd::contact::neutral, 13% range checks, 18% iptsd::ipts::Reader::read, and 24% iptsd::contacts::ContactFinder::track mostly 17% std::min_element. Since things are well below 1ms, I don't feel a need to optimize here for the basic detector.

Out of curiosity, which compiler are you using? I noticed that clang seems to produce faster code on my desktop than gcc.

With the sample data I am using, the mean times for the advanced processor look like this:

gcc with libstdc++ access checks: 1442.50μs
gcc without libstdc++ access checks: 1334.05μs
clang with libstdc++ access checks: 1253.75μs
clang without libstdc++ access checks: 1025.66μs

On different sample data, the difference has been up to a millisecond.

Currently iptsd enables the libstdc++ access checks (-D_GLIBCXX_ASSERTIONS) by default, because they have almost no impact on the basic detector and the additional checks are nice to have. A very simple optimization could be to build our packages with clang and to always disable the access checks for the advanced touch processor.

diff --git a/src/contacts/advanced/detector.cpp b/src/contacts/advanced/detector.cpp
index 550827a..e939ebe 100644
--- a/src/contacts/advanced/detector.cpp
+++ b/src/contacts/advanced/detector.cpp
@@ -1,3 +1,7 @@
+#ifdef _GLIBCXX_ASSERTIONS
+#undef _GLIBCXX_ASSERTIONS
+#endif
+
 #include "detector.hpp"

 #include "../neutral.hpp"

I am on gcc. The constexpr m_kern_pp in detector.cpp doesn't even compile on clang. I had to s/constexpr/const there. As you suggested, I did CC=clang CXX=clang++ meson --wipe build, and reran it on 9e50f9e70a3ebcccbd1c4dbc3514b8279f548b96:

[07:20:47.375] [info] Vendor:       045E
[07:20:47.375] [info] Product:      099F
[07:20:47.375] [info] Buffer Size:  7487
[07:20:47.375] [info] Metadata:
[07:20:47.375] [info] rows=44, columns=64
[07:20:47.375] [info] width=25978, height=17319
[07:20:47.375] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:20:47.375] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:20:49.224] [info] Ran 2010 times
[07:20:49.224] [info] Total: 1839020μs
[07:20:49.224] [info] Mean: 914.94μs
[07:20:49.224] [info] Standard Deviation: 280.80μs
[07:20:49.224] [info] Minimum: 125.850μs
[07:20:49.224] [info] Maximum: 1468.382μs

The compiler does indeed produce faster binaries. I applied your bounds check patch and here is the output:

[07:25:13.548] [info] Vendor:       045E
[07:25:13.548] [info] Product:      099F
[07:25:13.548] [info] Buffer Size:  7487
[07:25:13.548] [info] Metadata:
[07:25:13.548] [info] rows=44, columns=64
[07:25:13.548] [info] width=25978, height=17319
[07:25:13.548] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:25:13.548] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:25:15.390] [info] Ran 2010 times
[07:25:15.390] [info] Total: 1831444μs
[07:25:15.390] [info] Mean: 911.17μs
[07:25:15.390] [info] Standard Deviation: 278.14μs
[07:25:15.390] [info] Minimum: 124.460μs
[07:25:15.390] [info] Maximum: 1730.592μs

One problem with clang is that, unlike gcc, it doesn't show function names for inlined functions. This is the perf after that patch and

I checked out master at 8624107747a91d9f43e84f62f27a953c71d1a78f. It did not build on clang until I replaced dump.cpp with int main() {}. Here is the unpatched output:

[07:33:06.562] [info] Vendor:       045E
[07:33:06.562] [info] Product:      099F
[07:33:06.562] [info] Buffer Size:  7487
[07:33:06.562] [info] Metadata:
[07:33:06.562] [info] rows=44, columns=64
[07:33:06.562] [info] width=25978, height=17319
[07:33:06.562] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:33:06.562] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:33:09.674] [info] Ran 2010 times
[07:33:09.674] [info] Total: 3102096μs
[07:33:09.674] [info] Mean: 1543.33μs
[07:33:09.674] [info] Standard Deviation: 475.15μs
[07:33:09.674] [info] Minimum: 189.010μs
[07:33:09.674] [info] Maximum: 2446.472μs

Here is the patched output:

[07:34:25.648] [info] Vendor:       045E
[07:34:25.648] [info] Product:      099F
[07:34:25.648] [info] Buffer Size:  7487
[07:34:25.648] [info] Metadata:
[07:34:25.648] [info] rows=44, columns=64
[07:34:25.648] [info] width=25978, height=17319
[07:34:25.648] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[07:34:25.648] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[07:34:28.322] [info] Ran 2010 times
[07:34:28.322] [info] Total: 2663423μs
[07:34:28.322] [info] Mean: 1325.09μs
[07:34:28.322] [info] Standard Deviation: 420.22μs
[07:34:28.322] [info] Minimum: 141.820μs
[07:34:28.322] [info] Maximum: 2730.163μs

It's still takes too long on std::priority_queue::push. STL can't break apart the object or have the pop-then-push optimization. Here is the perf graph for that:

Environment:

home@daniel-desktop3:~$ gcc --version
gcc (Ubuntu 12.2.0-3ubuntu1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

home@daniel-desktop3:~$ g++ --version
g++ (Ubuntu 12.2.0-3ubuntu1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

home@daniel-desktop3:~$ clang --version
Ubuntu clang version 15.0.7
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
home@daniel-desktop3:~$ clang++ --version
Ubuntu clang version 15.0.7
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
home@daniel-desktop3:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.10
Release:        22.10
Codename:       kinetic
home@daniel-desktop3:~$ uname -a
Linux daniel-desktop3 5.19.0-31-generic #32-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 20 15:20:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

My testing data is iptsdump.dat.zip and my testing config is iptsd.conf.txt. Why is there a priority_queue is the first place? This means that this is a branch-heavy algorithm and we have been targeting average case instead of worst case. This is show by the CPU skyrocketing to 100% when I use all 10 fingers. I might someday consider implementing https://github.com/qzed/digitizer-prototype branchlessly to see if that works.

Why is there a priority_queue is the first place?

Mostly because I wasn't sure how to implement it without one, other than iterating some N times over the whole image for potentially larger N (an easy approximation would be limiting that N).

The general idea here is that we first extract "core" components, where you know that those are separate touch points. Extracting those reliably unfortunately drops some of the surrounding area, and ideally we'd like to include that because we want to properly filter out any non-finger touches. The real problem here is that the whole Gaussian fitting doesn't work well if a) there are non-Gaussian-like shapes or b) some residual values from improperly filtering out non-finger touches remain (which then skew the results).

So we essentially start of with those "core" components, then do a weighted distance transform from them to include the surrounding areas until we have fully broken down the whole sensor image into its individual components. Note that some components can be overlapping, where the costs of the WDT help. So what the WDT does is essentially compute the smallest "weighted distance" from some surrounding pixel to its nearest component (and also the label of that component). The queue is a fairly naive implementation of this, I guess, where you just a) get the next candidate pixel with the smallest cost, write it down, and add its non-evaluated neighbors to the queue.

I'm not really sure how to implement that without too much branching. As I've said, the only other idea that I had is just some iterative approach that runs across the full image a couple of times. Theoretically, that N should be quite large (essentially the longest path possible between two pixels) but in practice you can probably reduce it quite a bit. Still lots of "empty" space in each iteration, and the N is another hyperparameter to tune. This is probably the way I'd have implemented it if I'd wrote it for a GPU.

One thing that I though about ages ago was that it should be possible to: a) get the core components, b) compute a bounding box around them (just take 5 px on each side or something like that) and then process those smaller images. The "halo" around the components should be large enough to contain any other intersecting component labels, otherwise it won't be able to properly filter out things. It might also be possible to do some labeling before that with a lower threshold to extract the larger connected components and then just work on those (to ensure that intersecting components are in the same extracted frame).

I ran the same test on my surface now, this time using your data and config file @danielzgtg

env CC=gcc CXX=g++ meson setup build

[14:41:29.630] [info] Ran 2010 times
[14:41:29.630] [info] Total: 6666318μs
[14:41:29.630] [info] Mean: 3316.58μs
[14:41:29.630] [info] Standard Deviation: 1113.44μs
[14:41:29.630] [info] Minimum: 234.274μs
[14:41:29.630] [info] Maximum: 6310.144μs

env CC=gcc CXX=g++ meson setup build -Daccess_checks=false

[14:42:18.627] [info] Ran 2010 times
[14:42:18.627] [info] Total: 5806799μs
[14:42:18.627] [info] Mean: 2888.95μs
[14:42:18.627] [info] Standard Deviation: 994.28μs
[14:42:18.627] [info] Minimum: 134.039μs
[14:42:18.627] [info] Maximum: 6440.459μs

env CC=clang CXX=clang++ meson setup build

[14:41:54.287] [info] Ran 2010 times
[14:41:54.287] [info] Total: 5546825μs
[14:41:54.287] [info] Mean: 2759.61μs
[14:41:54.287] [info] Standard Deviation: 855.63μs
[14:41:54.287] [info] Minimum: 357.218μs
[14:41:54.287] [info] Maximum: 8571.065μs

env CC=clang CXX=clang++ meson setup build -Daccess_checks=false

[14:42:43.315] [info] Ran 2010 times
[14:42:43.316] [info] Total: 4434157μs
[14:42:43.316] [info] Mean: 2206.05μs
[14:42:43.316] [info] Standard Deviation: 686.58μs
[14:42:43.316] [info] Minimum: 244.878μs
[14:42:43.316] [info] Maximum: 4347.347μs

Since the advanced algorithm is gone now, I am going to close this. Nonetheless, thank you for your work on optimizing it!

linux-surface / iptsd

Preview: Optimize the advanced detector further by 50% #115

Before

After