Reduce the size of the lookup tables

lemire commented 6 years ago

The current lookup tables are quite large. Finding a way to substantially reduce their memory usage without adversally affecting performance would be a worthy goal.

aqrit commented 6 years ago

The table for the encoding shuffle can be reduced by 75%. This would cost 1 cycle per loop, I believe.

Same idea as the small tables in simd_prune/despacer projects. Bit-6 and bit-7 of the key byte can be ignored if the corresponding input bytes (the last four bytes) are always written to the output. Unwanted output byte(s) will be overwritten by the next chunk, anyways.

Thus the table only needs 64x16 bytes instead of 256x16 bytes.

aqrit commented 6 years ago

Just mask off the top 2-bits of the key byte

https://gist.github.com/aqrit/9272c47b3f1ce23c565a7210b6935102#file-svb_alt-c-L398

lemire commented 6 years ago

Damn!!! You are right.

lemire commented 6 years ago

Update: I was thinking about the encoder in my reply.

Coming back to it, I don't see how it can work. The last two bits give us how many bytes are non-null in the last of 4 integers.

Suppose you don't have this information... then you will fill the last four bytes with data. It is not true that "unwanted output byte(s) will be overwritten by the next chunk, anyways." You need some way to set to zero the unwanted bytes.

If you have some way to make it work, please share.

Your general idea is directly applicable to other problems, however. But I don't see it here.

aqrit commented 6 years ago

You need some way to set to zero the unwanted bytes.

The encoder never needs to zero bytes. Ever.

Suppose you don't have this information...

The info is still saved for use by the decoder.

here is a working drop-in-replacement: https://gist.github.com/aqrit/746d2f5e4ad1909230e2283272333dc1

please test.

lemire commented 6 years ago

Oh! OH ! For the encoder. I see. Yes.

lemire commented 6 years ago

Good point. I was thinking about the decoder.

lemire commented 6 years ago

@KWillets @vkazanov : It seems that @aqrit has a fast vectorized encoder that uses smaller tables (and thus less cache).

aqrit commented 6 years ago

The decoding shuffle table size could be reduced by 75% ... (256x16 vs 256x4) at a cost of 1 cycle per 128-bits of output (AFAIK)

The shuffle control masks get compressed to 32-bit integers. Masks are decompressed (32->128) by two instructions. However, the table index now scales for "free", saving 1 instruction.

gist

lemire commented 6 years ago

Looks clever. I'll investigate soon.

lemire commented 6 years ago

Completed with https://github.com/lemire/streamvbyte/commit/efb310d51e82a76bf9f1353d052e441985916a97

Your name has been added as author.

aqrit commented 6 years ago

Neat. Though, I've still got things to try on my todo list:

Transpose of elements during delta encoding to boost delta decode speed? Back of the envelope calculations -- for a block of 14 xmmwords - indicate that'd shave 30 instructions off from decoding (per block)... but one would have to wait until all 14 xmmwords were unpacked to begin the delta decode step, so...?
Using AVX2, check the speed of generating the 'compact' (32-bit) decode shuffle control on-the-fly? Obviously this requires a format change, the 2-bit keys would have to be interleaved (1 per byte) instead of packed (all 4 in one byte).

New tail loop on SSSE3 decoder if original count was 16 or greater.

dataPtr -= 4;
for(...){
size_t k = keys & 3;
keys >>= 2;
dataPtr = dataPtr + k + 1;
uint32_t dw = *((uint32_t*)dataPtr); 
dw >>= ((k ^ 3) * 8); 
}

Add zigzag delta transform

lemire commented 6 years ago

+1

KWillets commented 5 years ago

I had a thought related to this for the decoder which might be interesting. It adds a few steps vs. a raw lookup, but it may allow scaling to larger register sizes with only logarithmic time growth, and sqrt-sized shuffle tables vs. the original LUT.

The idea is to split the blob of compressed bytes into 8-byte "lanes" based only on the length of the first two elements, and then into 4-byte words based on the lengths of the first and third. We use zero-filling to allow the second halves to be processed as fixed 8 or 4-byte fields.

Basic Algorithm: Given a stream of bytes and a control byte {L1,L2,L3,L4}, deposit L1 bytes in the first work, L2 in the 2nd, etc., zero-filling the remaining bytes in each word.

Load L1+L2+L3+L4 bytes into a 16-byte register and zero-fill the remaining bytes.
Lookup up and apply a shuffle from L1+L2, which keeps the first L1+L2 bytes in place and moves the following 8 bytes to [8..15] (zero-filling the first 8 as needed).
Lookup and apply a shuffle from (L1,L3) which is similar in each 8-byte lane: Keep the first L1 and L3 bytes in place and move the following 4 bytes to [4..7] and [12..15] respectively, zero-filling the first and third words.

The lookup tables in steps 1 and 2 are size 7 and 16, respectively, although the lookups based on L1+L2 and (L1, L3) may need an intermediate table to extract those values quickly.

Step 1 may also be skipped by making the shuffle in step 2 include the length of the second half (49 rather than 7 shuffles), so that the shuffle can work directly on the byte stream without pre-masking.

For SSE this is slower, but for AVX etc. it's O(log(register width)) steps, and eg the shuffle tables for 8-at-a-time max out at 256 elements rather than 64k (sqrt of the original method).

lemire commented 5 years ago

Interesting.

KWillets commented 5 years ago

I was looking at this for UTF-8 conversion as well, but it seems easier when the control byte is already available. utf-8 needs a lot of pmovmskb/pdep/tzcnt's to get the lengths.

aqrit commented 5 years ago

Step #1 is expensive w/AVX2. AVX2 can't shuffle bytes across 128-bit lanes, only dwords/qwords/owords. It makes me doubt that this approach would be better than the "compressed shuffle table" I posted earlier in this thread.

lemire / streamvbyte

Reduce the size of the lookup tables #14