linux-surface / iptsd

Userspace daemon for Intel Precise Touch & Stylus
GNU General Public License v2.0
86 stars 39 forks source link

Optimize advanced detector by 34% in perf #113

Closed danielzgtg closed 1 year ago

danielzgtg commented 1 year ago

This PR improves the advanced detector algorithm running time by 34% from 3000us to 2400us. This was measured on iptsd-perf using Linux perf.

I ran the profiler on iptsd-perf and found it strange that 25% of the time was spent on std::hypotf. I took a look at the code only to find that function was only called on constants. I tried adding inline to get_cost, but the compiler still wouldn't inline it. Therefore, I forced the compiler to inline things using template parameters and constexpr. This improved the performance by 20%.

std::reference_wrapper is used in the class desugared from the lambda by request. It initially caused a small degradation, until I did the following. In the disassembly, I see a lot of calls related to std::vector, so I determined that bounds checks were taking a lot of time. I applied the common::unchecked pattern to this lambda, and the nearby ones. This improved the performance by a further 17%.

These measurements were done on iptsd-perf on a separate machine. With the actual Surface Pro 7 device, and the actual iptsd daemon, it was around 50%+ CPU usage before, and around 45% after. I would have investigated that discrepancy if I had a working perf. It's the same 34% in perf, but a 10% in iptsd. The actual numbers follow:

On-device numbers ``` # 0 commits [03:19:06.555] [info] Vendor: 045E [03:19:06.555] [info] Product: 099F [03:19:06.555] [info] Buffer Size: 7487 [03:19:06.555] [info] Metadata: [03:19:06.555] [info] rows=44, columns=64 [03:19:06.555] [info] width=25978, height=17319 [03:19:06.555] [info] transform=[-412.3492,0,25978,0,-402.76746,17319] [03:19:06.555] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2] [03:19:14.614] [info] Ran 2010 times [03:19:14.614] [info] Total: 8042632μs [03:19:14.614] [info] Mean: 4001.31μs [03:19:14.614] [info] Standard Deviation: 1399.72μs [03:19:14.614] [info] Minimum: 181.138μs [03:19:14.614] [info] Maximum: 7771.524μs # 1 commit [03:17:18.905] [info] Vendor: 045E [03:17:18.905] [info] Product: 099F [03:17:18.905] [info] Buffer Size: 7487 [03:17:18.905] [info] Metadata: [03:17:18.905] [info] rows=44, columns=64 [03:17:18.905] [info] width=25978, height=17319 [03:17:18.905] [info] transform=[-412.3492,0,25978,0,-402.76746,17319] [03:17:18.905] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2] [03:17:25.786] [info] Ran 2010 times [03:17:25.786] [info] Total: 6860737μs [03:17:25.786] [info] Mean: 3413.30μs [03:17:25.786] [info] Standard Deviation: 1174.28μs [03:17:25.786] [info] Minimum: 180.452μs [03:17:25.786] [info] Maximum: 6039.696μs # commits [03:16:14.245] [info] Vendor: 045E [03:16:14.245] [info] Product: 099F [03:16:14.245] [info] Buffer Size: 7487 [03:16:14.245] [info] Metadata: [03:16:14.245] [info] rows=44, columns=64 [03:16:14.245] [info] width=25978, height=17319 [03:16:14.245] [info] transform=[-412.3492,0,25978,0,-402.76746,17319] [03:16:14.245] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2] [03:16:19.668] [info] Ran 2010 times [03:16:19.668] [info] Total: 5399899μs [03:16:19.668] [info] Mean: 2686.52μs [03:16:19.668] [info] Standard Deviation: 906.13μs [03:16:19.668] [info] Minimum: 177.923μs [03:16:19.668] [info] Maximum: 4784.426μs ```

Before

/home/home/CLionProjects/iptsd/build/src/debug/iptsd-perf ./iptsdump.dat
[08:03:23.027] [info] Vendor:       045E
[08:03:23.027] [info] Product:      099F
[08:03:23.027] [info] Buffer Size:  7487
[08:03:23.027] [info] Metadata:
[08:03:23.027] [info] rows=44, columns=64
[08:03:23.027] [info] width=25978, height=17319
[08:03:23.027] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[08:03:23.027] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[08:03:29.048] [info] Ran 2010 times
[08:03:29.048] [info] Total: 6011079μs
[08:03:29.048] [info] Mean: 2990.59μs
[08:03:29.048] [info] Standard Deviation: 1045.43μs
[08:03:29.048] [info] Minimum: 147.510μs
[08:03:29.048] [info] Maximum: 5385.695μs

image

After

[22:20:12.187] [info] Vendor:       045E
[22:20:12.187] [info] Product:      099F
[22:20:12.187] [info] Buffer Size:  7487
[22:20:12.187] [info] Metadata:
[22:20:12.187] [info] rows=44, columns=64
[22:20:12.187] [info] width=25978, height=17319
[22:20:12.187] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[22:20:12.187] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[22:20:16.228] [info] Ran 2010 times
[22:20:16.228] [info] Total: 4031356μs
[22:20:16.228] [info] Mean: 2005.65μs
[22:20:16.228] [info] Standard Deviation: 674.57μs
[22:20:16.228] [info] Minimum: 144.730μs
[22:20:16.228] [info] Maximum: 3357.086μs

image

qzed commented 1 year ago

Ah, that's a good catch. Somewhat disappointing that it didn't inline/pre-compute it, especially since it was a lambda...

One note though: I'm not sure how much time we should spend on optimizing the "advanced" processor. I think it's unlikely that we can get it down to a reasonable performance level without changing up the processing. It's probably a better idea to look into how we can make the basic processing more robust/improve it.

StollD commented 1 year ago

Somewhat disappointing that it didn't inline/pre-compute it, especially since it was a lambda...

Maybe the reason for that is that the weighted_distance_transform function is not marked as inline? It seperates the lambda from the (inline) function that calls it.

qzed commented 1 year ago

I don't think that should be an issue. Although I'm not entirely sure about what the standard prescribes, the lambda should have a unique type, so weighted_distance_transform() should get specifically instantiated for that lambda (and the compiler should probably deduce that it's not going to re-use that implementation for any other function due to the type uniqueness). The problem is that the compiler might recognize the lambda as non-trivial function, so it doesn't inline it, which might be the reason it misses out on the optimizations. So maybe one could try to force-inline the lambda, but I'm not sure if that is actually beneficial in the end...

danielzgtg commented 1 year ago

I just switched to a new branch and made a few test edits. It is still 3000us after putting [[gnu::always_inline]] inline in front of every single function in distance_transform.hpp. Putting that only in front of weighted_distance_transform() gives the same result.

I was thinking it's because hypotf isn't constexpr, but turns out the compiler [can still inline that](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:2,endLineNumber:10,positionColumn:2,positionLineNumber:10,selectionStartColumn:2,selectionStartLineNumber:10,startColumn:2,startLineNumber:10),source:'%23include+%3Ccmath%3E%0A%23include+%3Ccstdio%3E%0A%0Afloat+test(int+x,+int+y)+%7B%0A++++return+hypotf(x,+y)%3B%0A%7D%0A%0Aint+main()+%7B%0A++++printf(%22%25f%22,+test(1,+2))%3B%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g122,deviceViewOpen:'1',filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O2',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+x86-64+gcc+12.2+(Editor+%231)',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4). I then replaced -> f32 in the lambda with __attribute__((always_inline)) -> f32 and got a lesser improvement:

/home/home/CLionProjects/iptsd/build/src/debug/iptsd-perf ./iptsdump.dat
[20:17:45.599] [info] Vendor:       045E
[20:17:45.599] [info] Product:      099F
[20:17:45.599] [info] Buffer Size:  7487
[20:17:45.599] [info] Metadata:
[20:17:45.599] [info] rows=44, columns=64
[20:17:45.599] [info] width=25978, height=17319
[20:17:45.599] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[20:17:45.599] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[20:17:51.037] [info] Ran 2010 times
[20:17:51.037] [info] Total: 5427634μs
[20:17:51.037] [info] Mean: 2700.32μs
[20:17:51.037] [info] Standard Deviation: 933.69μs
[20:17:51.037] [info] Minimum: 146.740μs
[20:17:51.037] [info] Maximum: 4833.335μs

Disassembly of that shows 2 references to sub_6870 to hypotf@GOT. The irrelevant one is _ZN5iptsd5debug4perfL4mainEN3gsl4spanIPcLm18446744073709551615EEE which seems to be CLI::Option's internals. The other one is _ZN5iptsd8contacts8advanced3alg4conv7kernels8gaussianIfLi5ELi5EEENS_9container6KernelIT_XT0_EXT1_EEES8_ which is in the advanced detection, but is far from a bottleneck as it barely shows as a vertical line in the perf graph. It is further confirmed by seeing the values after vaddss vmovss vfmadd231ss vmovss vaddss vminss. That's for the dist == 0, case, while the dist = sqrt(2)'s assembly looks more complicated (loading a float in a register, immediately saving it to the stack, calling a function, then loading the same value from the stack for no good reason) so it might not have been optimized properly. So inlining seems to be making a difference by removing that function call, but it fails to explain the other 10% improvement that the desugared template function gave us. Disassembly also reveals that the other lambdas below it such as _ZZN5iptsd8contacts8advanced12BlobDetector7processERKNS_9container5ImageIfEEENKUliE5_clEi aren't getting inlined either, possibly because of std::vector bounds checks.

I agree that compiler had trouble with the complexity. There are 43 calls to get_cost, of which there are 8 unique values for index2_t d. It may have run out of time during its optimization passes to realize that those 8 values simplify to 2 constant values in the end. On top of that there is evaluate adding to the layers of indirection the compiler needs to see through. The templates work here by forcing the compiler to consider what it needs to know.

danielzgtg commented 1 year ago

I changed * to std::reference_wrapper. For some reason that made it go to 2500us, an increase of 100us but still faster than how it was originally. It's strange because it's still the same vector and the reference_wrapper class doesn't do much.

I found unchecked() in access.hpp. I see it used in other parts of this advanced detector. In the assembly, I did see that there were calls related to std::vector. When that is used for the array accesses, we gain another 3% improvement and reach 2320us:

/home/home/CLionProjects/iptsd/build/src/debug/iptsd-perf ./iptsdump.dat
[21:47:32.988] [info] Vendor:       045E
[21:47:32.988] [info] Product:      099F
[21:47:32.988] [info] Buffer Size:  7487
[21:47:32.988] [info] Metadata:
[21:47:32.988] [info] rows=44, columns=64
[21:47:32.988] [info] width=25978, height=17319
[21:47:32.988] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[21:47:32.988] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[21:47:37.656] [info] Ran 2010 times
[21:47:37.656] [info] Total: 4657879μs
[21:47:37.656] [info] Mean: 2317.35μs
[21:47:37.656] [info] Standard Deviation: 789.74μs
[21:47:37.656] [info] Minimum: 147.491μs
[21:47:37.656] [info] Maximum: 3913.753μs
danielzgtg commented 1 year ago

So I added [gnu::always_inline] to the new class method and improved it by 40us:

/home/home/CLionProjects/iptsd/build/src/debug/iptsd-perf ./iptsdump.dat
[22:16:51.273] [info] Vendor:       045E
[22:16:51.273] [info] Product:      099F
[22:16:51.273] [info] Buffer Size:  7487
[22:16:51.273] [info] Metadata:
[22:16:51.273] [info] rows=44, columns=64
[22:16:51.273] [info] width=25978, height=17319
[22:16:51.273] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[22:16:51.273] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[22:16:55.835] [info] Ran 2010 times
[22:16:55.835] [info] Total: 4551187μs
[22:16:55.835] [info] Mean: 2264.27μs
[22:16:55.835] [info] Standard Deviation: 773.17μs
[22:16:55.835] [info] Minimum: 150.530μs
[22:16:55.835] [info] Maximum: 4426.569μs

I looked at the other nearby lambdas. We know from the disassembly that there are a lot of functions calls related to std::vector, so I applied the same common::unchecked pattern. __attribute__((always_inline)) is a 50us pessimization, so I did not include it. The final code has a big 250us improvement:

/home/home/CLionProjects/iptsd/build/src/debug/iptsd-perf ./iptsdump.dat
[22:20:12.187] [info] Vendor:       045E
[22:20:12.187] [info] Product:      099F
[22:20:12.187] [info] Buffer Size:  7487
[22:20:12.187] [info] Metadata:
[22:20:12.187] [info] rows=44, columns=64
[22:20:12.187] [info] width=25978, height=17319
[22:20:12.187] [info] transform=[-412.3492,0,25978,0,-402.76746,17319]
[22:20:12.187] [info] unknown=1, [178,182,180,1,178,182,180,1,90,171,100,20,172,177,175,2]
[22:20:16.228] [info] Ran 2010 times
[22:20:16.228] [info] Total: 4031356μs
[22:20:16.228] [info] Mean: 2005.65μs
[22:20:16.228] [info] Standard Deviation: 674.57μs
[22:20:16.228] [info] Minimum: 144.730μs
[22:20:16.228] [info] Maximum: 3357.086μs
danielzgtg commented 1 year ago

One note though: I'm not sure how much time we should spend on optimizing the "advanced" processor. I think it's unlikely that we can get it down to a reasonable performance level without changing up the processing. It's probably a better idea to look into how we can make the basic processing more robust/improve it.

I agree. I think I'm done with the optimizations for this PR, and further ones will probably investigate other modules. One thing I do really need to do is look at the math and algorithms used to see what I can do to the basic processing.

@qzed There is one other thing I see without changing up the processing. 5%+ is spent on priority_queue and vector, so maybe 5% could have been gained by changing those to be based on something like boost::static_vector. They could have been made fixed instead of dynamically allocated, but it would have taken me some time to debug heapifying up and down if I had to implement that myself. Furthermore, I'm considering something else instead.

My algorithms I wrote a few years ago were written in a certain way. It's more or less just the same as the basic detector, and like it, its output has quality problems. It was very fast though (but I read about a faster algorithm for one of its parts). Part of the reason it was so fast is that it was originally written for the GPU. This makes it branchless, preallocated, cache-friendly, aligned, vectorizable, and constant-time. I might investigate whether making other modules have these characteristics will improve them.

StollD commented 1 year ago

Thank you!

qzed commented 1 year ago

There is one other thing I see without changing up the processing. 5%+ is spent on priority_queue and vector, so maybe 5% could have been gained by changing those to be based on something like boost::static_vector. They could have been made fixed instead of dynamically allocated, but it would have taken me some time to debug heapifying up and down if I had to implement that myself.

I think the allocations should only happen sparingly and less so as time goes on: All dynamic data structures are preallocated in the constructor. After that, reallocation is only required if they grow out of these initial bounds. Nothing gets deallocated/reduced in size, so at some point there shouldn't really be any new allocations. It's possible that increasing the initially reserved memory helps, I've chosen those values mostly by what seems right and didn't do any testing.