google / leveldb

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
BSD 3-Clause "New" or "Revised" License
36.12k stars 7.78k forks source link

Trouble understanding how crc is computed #719

Open orrorcol opened 5 years ago

orrorcol commented 5 years ago

I did a lot of googling, but I still cannot understand the following lines: https://github.com/google/leveldb/blob/53e280b56866ac4c90a9f5fcfe02ebdfd4a19832/util/crc32c.cc#L296-L299

I can understand the normal way crc is computed, like this: https://github.com/google/leveldb/blob/53e280b56866ac4c90a9f5fcfe02ebdfd4a19832/util/crc32c.cc#L289-L290

verdy-p commented 2 years ago

That's just an evident classic optimization (for speed) frequently used in software, where the CRC is computed by reading groups of 4 bytes instead of reading input byte per byte (as long as the input is in a safe buffer whose readable size is a multiple of 4 bytes, and it is correctly padded with zeroes if needed).

This works by using 4 distinct "stride tables" (one for each byte-position), which is equivalent to the single lookup table used in a naive code where each byte in the group would be rescanned individually with 4 successive lookups. In fact each byte of the 32-bit CRC value do not depend on the value of the 3 other bytes, and each input byte only contributes to one of the 4 bytes of the CRC32 result. The "stride tables" perform the adjusment.

This is not a question specific to LevelDB implementation or usage, but this idea is general to the CRC32 algorithm itself (or all its variants, which may differ only on the byte order of the input and output bytes, and on how padding of input is performed, or by using a different initialization constant, or if it must return a complemented value). The same would apply to CRC16, or to longer CRC.

Note also that even the 8-bit CRC is itself an optimization: the lookup table is needed in that case, but would not be needed at all if the input was processed bit by bit (in that case it is implemented in hardware by very few binary gates, using a single bit shift and a few XOR ops). But in software, handling data bit by bit is too slow. Using an 8-bit lookup table is still fast because the lookup data easily fits entirely in the L1 data memory cache (256 bytes) of all modern CPUs. And the optimization above just uses 4 lookup tables with the same size, for a total of 1KB which also fits easily in the L1 cache of all modern CPUs (which typically have at least 32KB or more, each table filling only 16 "tagged rows" of about 64 bytes each, these rows are read from the L2 or L3 caches or from external memory bus with very few cycles: alla L1, L2, or L3 caches have much more than 16 tagged rows, so performing a table lookup in a tight loop will not flush and reload parts of the same table, which is read only once, and only flushed if there are concurrent threads running, in which case they will share the cache, but generally if there are only a few concurrent threads, they won't cause the tagged rows frequently accessed by a thread to be flushed by another thread during the tight loop, it will only be flushed after some time when the loop has exited, and there are other works to do with unrelated data stored elsewhere).

So in general, small lookup tables used in a tight loop are very fast, about 1 cycle, where as not using such lookups would require many more cycles with the unoptimized code processing the data byte by byte (or worse bit by bit): it is possible to do that with CRC simply because CRC uses simple XOR ops acting on bits at the same position, and not more complex arithmetic, and shifting the result for the next step is exactly as if the bits dit not move; as well the XOR only depends on parity of its operands and not on their relative order (XOR is commutative and associative, meaning that it can be parallelized and this makes the optimisation simple to do by just precomputing different "stride tables" with constants).

If you have a better processor with large enough caches, and your CPU supports 64-bit arithmetic, you could even use 8 stride tables (one for each byte) and process the input by group of 8 bytes and get even faster execution. But the performance gain will be smaller than the optimization using 4 stride tables. If your CPU was very small (e.g. Z80 or 6502 which use 8-bit arithmetic and don't have large L1 caches), such optimization would be of course invalid: lookups in random accessed table stored in external memory would dominate the execution time. And on hardware FPLA, using lookups is very costly, while using simple binary XOR gates and shits is extremely fast: all is about the architecture of the processing unit and how they can access data in the minimum of clock cycles.

orrorcol commented 2 years ago

@verdy-p Thankes very much! Your explanation is wonderful.

dxw1997 commented 1 year ago

That's just an evident classic optimization (for speed) frequently used in software, where the CRC is computed by reading groups of 4 bytes instead of reading input byte per byte (as long as the input is in a safe buffer whose readable size is a multiple of 4 bytes, and it is correctly padded with zeroes if needed).

This works by using 4 distinct "stride tables" (one for each byte-position), which is equivalent to the single lookup table used in a naive code where each byte in the group would be rescanned individually with 4 successive lookups. In fact each byte of the 32-bit CRC value do not depend on the value of the 3 other bytes, and each input byte only contributes to one of the 4 bytes of the CRC32 result. The "stride tables" perform the adjusment.

This is not a question specific to LevelDB implementation or usage, but this idea is general to the CRC32 algorithm itself (or all its variants, which may differ only on the byte order of the input and output bytes, and on how padding of input is performed, or by using a different initialization constant, or if it must return a complemented value). The same would apply to CRC16, or to longer CRC.

Note also that even the 8-bit CRC is itself an optimization: the lookup table is needed in that case, but would not be needed at all if the input was processed bit by bit (in that case it is implemented in hardware by very few binary gates, using a single bit shift and a few XOR ops). But in software, handling data bit by bit is too slow. Using an 8-bit lookup table is still fast because the lookup data easily fits entirely in the L1 data memory cache (256 bytes) of all modern CPUs. And the optimization above just uses 4 lookup tables with the same size, for a total of 1KB which also fits easily in the L1 cache of all modern CPUs (which typically have at least 32KB or more, each table filling only 16 "tagged rows" of about 64 bytes each, these rows are read from the L2 or L3 caches or from external memory bus with very few cycles: alla L1, L2, or L3 caches have much more than 16 tagged rows, so performing a table lookup in a tight loop will not flush and reload parts of the same table, which is read only once, and only flushed if there are concurrent threads running, in which case they will share the cache, but generally if there are only a few concurrent threads, they won't cause the tagged rows frequently accessed by a thread to be flushed by another thread during the tight loop, it will only be flushed after some time when the loop has exited, and there are other works to do with unrelated data stored elsewhere).

So in general, small lookup tables used in a tight loop are very fast, about 1 cycle, where as not using such lookups would require many more cycles with the unoptimized code processing the data byte by byte (or worse bit by bit): it is possible to do that with CRC simply because CRC uses simple XOR ops acting on bits at the same position, and not more complex arithmetic, and shifting the result for the next step is exactly as if the bits dit not move; as well the XOR only depends on parity of its operands and not on their relative order (XOR is commutative and associative, meaning that it can be parallelized and this makes the optimisation simple to do by just precomputing different "stride tables" with constants).

If you have a better processor with large enough caches, and your CPU supports 64-bit arithmetic, you could even use 8 stride tables (one for each byte) and process the input by group of 8 bytes and get even faster execution. But the performance gain will be smaller than the optimization using 4 stride tables. If your CPU was very small (e.g. Z80 or 6502 which use 8-bit arithmetic and don't have large L1 caches), such optimization would be of course invalid: lookups in random accessed table stored in external memory would dominate the execution time. And on hardware FPLA, using lookups is very costly, while using simple binary XOR gates and shits is extremely fast: all is about the architecture of the processing unit and how they can access data in the minimum of clock cycles.

But how to compute the values in kStrideExtensionTable0\kStrideExtensionTable1\kStrideExtensionTable2\kStrideExtensionTable3?

verdy-p commented 1 year ago

You compute these tables by using the exact definition in terms of binary gates (that are just combining some bits backwards at distince defined directly by the CRC polynomial used. Even the 8-bit only stride is an optimization made for all CPUs that prefer handling byte units in a single instruction and a smaller set of (8-bit) registers rather than using bit-per-bit instructions and eight (1-bit) registers. You don't need any stride lookup table on a FPLA for example or using 1-bit gates hardwired in the silicium.

So you can optimize the same 1-bit based CRC algorithm for 4-bit processors, 8-bit processors, 16-bit processors, 32-bit processors (but when you use larger strides, you also exponentially increase the size of the lookup table, and large lookup tables may become very slow if they don't fit entirely in the processor cache and need slow random accesses to memory).

The code here with 4 tables is only optimized for modern 32-bit processors that have memory caches with large enough size, so that refreshing the cache from memory reads will occur rarely. some older 32-bit processors have a small L1 data caches adn such optimization would be invalid in multitasking environemnts with many concurrent threads

That's also the reason why larger strides are not used, because the lookup table for a 64-bit stride would require too many gigabytes of L1 cache and prefilling this cache from memory would use too much bandwidth on memory buses (it would also increase a lot the implementation size with giant lookup tables, meaning that the compiled code would be a really too large file with a very slow startup and taking up too much resources on the host system.

Ideally, an implementation of CRC should use the best stride size according to the host platform and CPU type it targets. But the tuning then made depends on the existing market for most common architectures. However, implementing CRC polynomials on FPLA or hardwired silicium requires just an handful number of binary XOR gates. If you understand that there are only XOR gates, it means that you can predict how each input bit will turn into the set of stride bits after shifting each input bit and transform it a well defined number of time.