jatinchowdhury18 / RTNeural

Real-time neural network inferencing
BSD 3-Clause "New" or "Revised" License
571 stars 57 forks source link

SIMD vs memory management performance gain #39

Open zanellia opened 2 years ago

zanellia commented 2 years ago

Really cool project! Concerning your publication https://arxiv.org/pdf/2106.03037.pdf, what do you think is affecting the performance gain over PyTorch the most? SIMD or memory management? After a quick look, it seems that Aten in PyTorch is relying on SIMD too (e.g. here https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qnnpack/src/sgemm/5x8-neon.c), but it might still be a suboptimal usage? Or most of the performance gain in RTNeural likely comes from better memory management?

jatinchowdhury18 commented 2 years ago

Thank you! I would imagine the usage of SIMD in RTNeural probably doesn't give much of a performance improvement relative to PyTorch since their library uses SIMD as well. I think the biggest factor in the performance gain is the function inlining and loop unrolling that can be achieved with the compile-time API, and then memory management is probably the second biggest factor. For example, let's say we have a network made up of six dense layers that is run once every sample.

Without the compile-time optimizations, the runtime need to do six v-table lookups every sample to call the inferencing method for each layer, since the compiler doesn't know which type of layer its working with. Within each inferencing method, the runtime will need to loop through each "neuron" and do the math operations required.

With the compile-time optimizations, not only can we skip the v-table lookups, and unroll the loops, but the compiler can optimize the resulting assembly code much more fully, since it can now condense and parallelize operations that it couldn't before.

With memory management, it really depends on how PyTorch is managing that on their end... generally each dense layer might need to allocate/de-allocate a vector each time its inferencing method gets called. The performance hit from that allocation/de-allocation is probably negligible for 99% of the networks that people are running with PyTorch, but for audio-rate applications it will usually cause a significant performance hit.

Hope this information is helpful!

zanellia commented 2 years ago

Thanks a lot for the quick answer @jatinchowdhury18!

I would not be able to quantify the latency associated to v-table lookups, but I guess they very quickly become negligible as the size of the layer increases?

Since the heaviest linear algebra operations are anyway delegated to high-performance libraries (e.g. Eigen), when it comes to unrolling, does the main advantage boil down to implicit prefetch and/or better branch prediction?

Finally, (and sorry for asking so many questions, I just find the project and the topic very interesting :p) would the performance gain obtained with the run-time API of RTNeural mostly boil down to memory management?

Not sure if this is a completely unreasonable thing to try, but could these things be better quantified with a profiler like gprof? Or the absolute computation times are too short for gprof's resolution?

jatinchowdhury18 commented 2 years ago

No problem!

I would not be able to quantify the latency associated to v-table lookups, but I guess they very quickly become negligible as the size of the layer increases?

Yes, v-table lookups themselves aren't particularly slow, so they can definitely be amortized against heavier operations, but for RTNeural I think the performance improvements have more to with the fact that the compiler can't optimize "through" the v-table lookup, since it doesn't know what function will be called on the other side.

Since the heaviest linear algebra operations are anyway delegated to high-performance libraries (e.g. Eigen), when it comes to unrolling, does the main advantage boil down to implicit prefetch and/or better branch prediction?

I'm not really an expert on low-level CPU architecture, but as I understand it, if a loop is fully unrolled then there should be no branching at all! That way the CPU can prefetch as many instructions as it likes, and will never have a cache miss due to branching or anything like that. Here's a simple example in Compiler Explorer.

Finally, (and sorry for asking so many questions, I just find the project and the topic very interesting :p) would the performance gain obtained with the run-time API of RTNeural mostly boil down to memory management?

Not sure if this is a completely unreasonable thing to try, but could these things be better quantified with a profiler like gprof? Or the absolute computation times are too short for gprof's resolution?

I believe the performance gain from the run-time API is mostly due to memory management, though I'd have to study the PyTorch implementation in more detail to be sure. Using a profiler would definitely be helpful for understanding where the performance gains are coming from. I had spent some time with the MSVC profiler when working on the compile-time API, but I was mostly comparing against older versions of the library rather than comparing against other libraries.

zanellia commented 2 years ago

Thanks a lot @jatinchowdhury18! I might run gprof on some examples implemented with PyTorch in order to get a better idea.