Porting my AVX512 SP7 algorithm

danielzgtg commented 1 year ago

I have ported my GPU heatmap algorithm to CPU AVX512. It has gone from 11ms to 0.011534ms which is as high as 86700.19 FPS. This is benchmarked by 11534ms per 1000000 iterations. The AVX512 provides the focus on power efficiency and AVX512 does not incur a multithreaded penalty on SP7 IceLake. It also runs in O(1) time which may be important for security.

Unlike the old GPU code which was strongly coupled to vulkano and Rust, the new code can easily be ported into C/C++ iptsd. The relevant part is https://github.com/danielzgtg/ipts-rust/blob/d8f281c97ef083b32c0f49c05274683e7f9e9eaf/engine-avx512/src/lib.rs which can be pretty much 1-to-1 copied-and-pasted into a .c file.

I can perform the port if you like. The thing is that although the iptsd version on apt works, the new version either gives me any of the following errors:

home@daniel-tablet1:~/CLionProjects/iptsd$ build/src/daemon/iptsd /dev/ipts/0
[02:35:00.733] [error] Failed to open hidraw device: Permission denied
home@daniel-tablet1:~/CLionProjects/iptsd$ build/src/daemon/iptsd /dev/ipts/0
[02:35:16.774] [error] Failed to read HID device info: Inappropriate ioctl for device
home@daniel-tablet1:~/CLionProjects/iptsd$ build/src/daemon/iptsd /dev/ipts/hidraw0
[02:36:18.873] [error] Failed to open hidraw device: No such file or directory

If you would like me to port this, can you please tell me where to put the code and how to get iptsd to run?

StollD commented 1 year ago

You need to build the ipts kernel driver from source and load it. That will create a device node called /dev/hidrawN. The N is not fixed, so you need to check dmesg for it: dmesg | grep IPTS. You can then run iptsd like this: sudo build/src/daemon/iptsd /dev/hidrawN

The code related to contact finding is all in src/contacts: https://github.com/linux-surface/iptsd/tree/master/src/contacts

The touch detection is split between a blob detector and the contact finder. The blob detector just returns all contacts from the heatmap together with their size, and the contact finder then handles palm detection and finger tracking. The blob detector is changable, currently there are two different ones, the original basic algorithm from iptsd and the more advanced one from @qzed.

It should be possible to add your algorithm as an additional blob detector. If it also does palm detection, there would be some overhead, because everything will be filtered twice, but this could be optimized later.

One thing to keep in mind though is that iptsd also needs to run on the Surface Pro X, which is ARM, and has no AVX512 (maybe there is an equivalent?). So there would need to be some sort of check that disables your algorithm when IPTSD is compiled for ARM. But thats something we can worry about when we get there.

danielzgtg commented 1 year ago

algorithm as an additional blob detector

It was not really the objective to add my algorithm as a separate thing. It could be used to save power, but it wasn't the main objective. After all, my algorithm is basically just the first step of your algorithm, with the rest of it missing.

My main objective is to optimize the convolutions in the more accurate version. I remember you saying that the current code is mostly convolutions and they are a bit expensive. I don't think it's automatically optimized because gcc only generates up to AVX2 the last time I checked, and it would shock me if any compiler could generate the AVX512 conflict detection instructions I used.

which is ARM

That would be just CPU feature detection. I would also be important for automated testing if that is present. Furthermore, the feature detection along with an emergency disable override is important in case it causes regressions for other models.

maybe there is an equivalent

It's ARM NEON. gcc should autovectorize and I don't know which manually written instructions would be of any benefit at all. I already skipped SSE and AVX2 straight to AVX512 because it's more consistent so easier to learn, and is ~50% less code I need to write. It should be months before anyone investigates using NEON. Anyway, this week is full of exams for me, and I expect to start the original plan no earlier than next week. As an update, I am busy setting up secure boot and disk encryption as that is taking a week instead of the day I expected, so probably another week or two before this.

danielzgtg commented 1 year ago

I just looked at the code. There is a serious incompatibility. I am using u8/u16 while you are using f64. The most I can do is try to make sure the loops are arranged so they will autovectorize.

StollD commented 1 year ago

I was also thinking about this recently, and I noticed that we would need some sort of dynamic dispatch, because in the future we would want to support the AMD Surface Laptops and the Intel devices with the same binary, and the early AMD variants to my knowledge won't have AVX-512 (and newer Intel CPUs are removing it again, aren't they?).

danielzgtg commented 1 year ago

We can just forget about AVX512.

We have some AVX2 in the assembly after you added x86-64-v3 with 82f8b6e158aad0dd605b84649aecfe18f67f9228. Profiling does not point to AVX2 as the bottleneck. The bottleneck is the remaining code like evaluate in the advanced detector. It does not use any kind of autovectorized SIMD because it is heavily branching. SIMD processes 4+x more elements per cycle, while branch mispredictions cost 4+ cycles. If we just write regular C++ loops that don't branch or recurse, we will be 90% of the way there.

My objectives when writing my AVX512 code were the following:

Ensure the CPU logic matched the as GPU code closely as possible
Learn manual SIMD intrinsics
Learn AVX512
Establish a practical (non .S file) performance upper bound
Have a driver that ran fast and saved battery
Use something easier than AVX2

Half of these objectives are non-existent in iptsd. It also not very readable at the moment and was relying on the GLSL for documentation, but that link is weakening from them drifting apart. It would benefit from me grounding it in C/C++ instead.

I find the development experience easier on AVX512 than AVX2. It's more than 2x less code to write than just having to write code for the other half of the columns. It also involves shifting left and right across all the columns, and that's complicated in the middle of the screen if the other half of the screen is in another register. If I had to write both AVX2 and AVX512, I would have just written AVX2 as writing 2x code is better than writing (3=2+1)x code. I might have also invented the off-label use of AVX512CD as hashmaps.

Regarding the dynamic dispatch we already have some. There is the config loading and the unique_ptr. We would want to add separate backends for experimentation without affecting existing users. If profiling ever does point to AVX2 as a bottleneck, we can add the separate compiles as a separate backends. Or not and use GCC's automatic dynamic dispatch attribute based on CPU features.

linux-surface / iptsd

Porting my AVX512 SP7 algorithm #88