vidpid->state lookup container improvement; changing to `std::array`

diablodale commented 1 year ago

While refactoring XLink using libusb wrappers, I saw that the vidpid -> xlink state lookup is implemented as std::unordered_map vidPidToDeviceState. For XLink's use case, this is a poor choice as it uses more memory, more binary size, and more cpu than a simple solution with std::array.

I wanted to learn/play with some new ai and benchmarking tools. They all agree to switch to std::array. I am changing it in my refactor.

Setup

XLink at all current major branches
all platforms and compilers

Repro

First found with code review. godbolts and benchmarks below.

with only 4 elements, std::array is more efficient in code and memory than std::unordered_map
std::array has zero memory overhead and no allocations; can store the values at compile-time; search by comparing (at most) 8 ints
std::vector has memory overhead of ~24 bytes (on 64bit); runtime has to insert values into heap; search by jumping to heap location of vector data and comparing (at most) 8 ints
std::unordered_map has memory overhead of 72+ bytes (on 64bit) + ~20 bytes per element; runtime has to insert values; search by hash calc, hash calc, binary op, jump to calculated hash bucket pointer, jump to buckets linked list, walk linked list and compare 2 ints for each entity until find the match (walking list up to 4 times...likely less)

For this use-case of fixed never-changing elements and very small size, std::array is the best choice.

At https://godbolt.org/z/KM3E9Gdfd is code for both today's std::ordered_map and a new impl with std::array. You can chose which you want with -DUSE_UNMAP or -DUSE_ARRAY. After it compiles, notice several things

Result

The std::array implementation has compile-time created the array and stored it in the binary for immediate use by the app. This results in faster load times, no overhead, no memory allocations. gcc is also optimizing the compiled code given it knows what is in the array at compile-time. In contrast, the map implementation will have the raw initialization data stored somewhere in the binary and then it has to make heap allocations for each created entry into the map and make copies of the initialization data into them.
The binary size difference is substantial. array = 95,505 and unordered_map = 497,156. The map is 4x larger. (additional 400kb) in filesize 😥. All that extra code for nothing since this map is never mutated and t-i-n-y.

When benchmarked https://quick-bench.com/q/CNuAH-ebWBdson0Epl-F--C3ZXI the array implementation is 3.8x faster than map.

Slightly altered code to generate random vidpids show array still faster (1.3x faster) but not as dramatic. https://quick-bench.com/q/orjPkib5xLpID0nitgWJc-VowTY. I think this is due to the addional load of generating random numbers and that since a fixed search vidpid isn't known at compile time, then gcc can't optimize as aggressively.

The speed increase is not really significant. If that were the only thing I would say this is microoptimization ;-) However, the file size is significant. Saving 400kb in filesize justifies the trivial change to std::array.

themarpe commented 1 year ago

The speed increase is not really significant. If that were the only thing I would say this is microoptimization ;-) However, the file size is significant. Saving 400kb in filesize justifies the trivial change to std::array.

Agree on the conclusion - though WRT file size, what is that measuring, final compiled binary?

diablodale commented 1 year ago

Yes final -O2 compiled binary size on disk. You can see it in the godbolt links. It is in tiny text in the compiler output pane. I do also see unordered_map being used for the createPlatformDeviceFdKey() map store. The file savings might not be as large since some of it is (maybe?) shared.

The createPlatformDeviceFdKey map itself is a area that I'll investigate within just a few days as it involves my c++ work in #63. This is one of the C <-> C++ boundaries that I want to handle well. During that investigation, I'll explore how large that map becomes and how often it is mutated....to understand what container is best used for it. While the USB scenario isn't so many (only so many ports usually on a computer), the PoE scenario could be up to MAX_LINKS which is currently 64.

themarpe commented 1 year ago

Yes final -O2 compiled binary size on disk. You can see it in the godbolt links. It is in tiny text in the compiler output pane. I do also see unordered_map being used for the createPlatformDeviceFdKey() map store. The file savings might not be as large since some of it is (maybe?) shared.

Got it - yeah, in that case would be intermixed with other uses

The createPlatformDeviceFdKey map itself is a area that I'll investigate within just a few days as it involves my c++ work in https://github.com/luxonis/XLink/pull/63. This is one of the C <-> C++ boundaries that I want to handle well. During that investigation, I'll explore how large that map becomes and how often it is mutated....to understand what container is best used for it. While the USB scenario isn't so many (only so many ports usually on a computer), the PoE scenario could be up to MAX_LINKS which is currently 64.

This is a "one time" modded structure per up/down event. And the lookups thereafter to retrieve the actual FD (per R/W event)

diablodale commented 1 year ago

This is fixed in my fork of XLink. It is faster (tests above) and smaller filesize (also above). The latter is most beneficial when xlink is a DLL instead of a static library. While this is function is not a high-volume use, these fixes are code simplification; removing unneeded complexity and filesize.

Closing as the fixes can be seen around https://github.com/diablodale/XLink/blob/636dc4142774e11f9a1d68125575f2782d65744a/src/pc/protocols/usb_host.cpp#L165

luxonis / XLink