nfrechette / acl

Animation Compression Library
MIT License
1.38k stars 108 forks source link

Optimize vector unpacking #114

Open nfrechette opened 6 years ago

nfrechette commented 6 years ago

Animated data is currently packed with Big Endian ordering but for decompression speed, Little Endian might be better. See here: https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far-too-many-ways-part-1/

This is not a trivial issue. SSE versions must also be written and compared and NEON as well. AVX-512 has a rotate intrinsic but that is not available on PS4/XB1 nor most processors out in the wild that we currently care about. BMI 1/2 can be used as it is supported everywhere we care about and so it may turn out that SIMD is slower than scalar.

nfrechette commented 6 years ago

See https://github.com/nfrechette/acl/tree/research/unpack_vector3_n for WIP

nfrechette commented 4 years ago

Perhaps we can do a switch statement with possible alignment values. The compiler might be able to do an efficient branch since every case label should have the same size. Does it lead to less data fetched? At least it is linear in the code and possibly prefetched during execution. Perhaps _mm_alignr_epi8 can be used with SSE3.

See also: http://web.archive.org/web/20120408131243/http://x264dev.multimedia.cx/archives/8 and: http://web.archive.org/web/20120417184641/http://x264dev.multimedia.cx/archives/96 and: https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables

nfrechette commented 2 years ago

Because we prefetch a lot of data, we have the opportunity to do some work for free hidden behind the memory latency. In particular, before we touch the segment data, we prefetch to prime the TLB as it is highly likely to touch a different memory page and when we do, the prefetch will stall.

Instead of relying on hard-coded constants which could be very large (16-32 bytes per bit rate), we could generate them on the fly every time we decompress. These can be written on the stack which will be warm in L1. They can be efficiently generated through SIMD as well. This work is entirely independent and we should be able to pack the necessary data into the instruction stream as immediate values. This will avoid cache misses.

For pose decompression, this should be a net win but it isn't clear if this will be the case for single track decompression. Here we might fallback to the current code.