Closed nigeltao closed 5 months ago
Disabling sanitzers is indeed scary. But also, I don't think the disabling is required to use x86 SIMD.
This requires that the CRC tables are always available. Currently those can be omitted to reduce library size if appropriate compiler flags are used which indicate unconditional support for the CLMUL instructions. For CRC32 this isn't fully implemented yet though.
Instead, you can add
crc_simd_body
preconditions that itsbuf
andsize
arguments must be 16-byte aligned. Make the callers (crc32_arch_optimized
andcrc64_arch_optimized
) responsible for the leading/lagging bytes that aren't 16-byte aligned. Out-of-bounds reads are no longer needed.
This is obviously possible. I kept CRC_USE_GENERIC_FOR_SMALL_INPUTS as a comment because with very tiny buffers the CLMUL version could be worse. So your suggestion would be a step towards that direction. Performance comparison with tiny buffers and different alignment should be done to see that the alternative version is good too.
I'll add this to a list of things to look at.
This requires that the CRC tables are always available.
You probably already know this, but... just noting that the patch only needs the 1 x 256 x uint32_t
flavor (1 kilobyte of data) of the lzma_crc32_table
, not the full 8 x 256 x uint32_t
flavor (8 kilobytes). Plus the CRC-64 equivalent, obviously.
Ah, copy/paste typo in the diff. The two ~
s here should be omitted:
+ crc_simd_body(buf, size, &v0, &v1, vfold16,
+ _mm_cvtsi32_si128((int32_t)~crc));
etc.
+ crc = ~(uint32_t)_mm_extract_epi32(v0, 2);
https://github.com/JoernEngel/joernblog/blob/778f0007b9a580e477608691f1aa86369f0efdd2/crc64.c might be interesting (i.e. worth copying).
Using a 16-byte temporary buffer on stack for the first and last input bytes is certainly a very simple change. But I'll check the other option too, that is, making the SIMD code much shorter and using the table-method for the unaligned beginning and the end.
I created the branch crc_edits two weeks ago but I'm not happy with the performance especially with small buffers. I have better code coming which is faster with both big and small buffers and won't trigger sanitizers. :-) It won't be in the bug fix releases as it wouldn't make sense to risk regressions when the current code works correctly.
Note that doing 16-byte-aligned reads is likely a common way in assembly code. The idea is described, for example, in Agner Fog's optimizing_assembly.pdf (2023-Jul-01) page 126 "Reading from the nearest preceding 16-bytes boundary". Sanitizers don't sanitize assembly code and Valgrind is smart enough to see when bytes in the SIMD registers get ignored. Valgrind doesn't complain about the C code either. (Basically every __asm__
statement in the C code is implicitly also no_sanitize_address
but implicit doesn't look as scary.)
Also, the current CLMUL code isn't used merely on x86. It is compatible with E2K too. It's quite likely that separate versions are needed as unaligned access (MOVDQU) on x86 seems to fine in terms of performance on processors that support CLMUL. I have no clue about E2K myself but I hope to get feedback from the contributor who tested the current code on E2K.
See #127.
https://tukaani.org/xz-backdoor/review.html discusses the
crc_attr_no_sanitize_address
(i.e.__attribute__((__no_sanitize_address__))
) incrc_x86_clmul.h
:Disabling sanitzers is indeed scary. But also, I don't think the disabling is required to use x86 SIMD.
Instead, you can add
crc_simd_body
preconditions that itsbuf
andsize
arguments must be 16-byte aligned. Make the callers (crc32_arch_optimized
andcrc64_arch_optimized
) responsible for the leading/lagging bytes that aren't 16-byte aligned. Out-of-bounds reads are no longer needed.