Each lane of SIMD computes an interleaved path in the input. Maybe 8-chars at a time. Compiler should be able to produce masked operations automatically in case of predicted or not predicted dictionary updates (also means a lot of dictionaries in encoding/decoding). 8 dictionaries = 64kB (L2 bandwidth) or 64 dictionaries = 512kB (maybe L3 bandwidth). Any wider operations could cause operations to be as slow as volatile/RAM bandwidth.
Idea:
Each lane of SIMD computes an interleaved path in the input. Maybe 8-chars at a time. Compiler should be able to produce masked operations automatically in case of predicted or not predicted dictionary updates (also means a lot of dictionaries in encoding/decoding). 8 dictionaries = 64kB (L2 bandwidth) or 64 dictionaries = 512kB (maybe L3 bandwidth). Any wider operations could cause operations to be as slow as volatile/RAM bandwidth.