no_sanitize_address isn't required

nigeltao commented 2 months ago

https://tukaani.org/xz-backdoor/review.html discusses the crc_attr_no_sanitize_address (i.e. __attribute__((__no_sanitize_address__))) in crc_x86_clmul.h:

This attribute is obviously scary but it is unfortunately required with the x86 SIMD code. The code makes aligned 16-byte reads which may read up to 15 bytes before the beginning or past the end of the buffer if the buffer is misaligned. The unneeded bytes are then ignored. It cannot cross page boundaries and thus cannot cause access violations.

Disabling sanitzers is indeed scary. But also, I don't think the disabling is required to use x86 SIMD.

Instead, you can add crc_simd_body preconditions that its buf and size arguments must be 16-byte aligned. Make the callers (crc32_arch_optimized and crc64_arch_optimized) responsible for the leading/lagging bytes that aren't 16-byte aligned. Out-of-bounds reads are no longer needed.

diff --git a/src/liblzma/check/crc_x86_clmul.h b/src/liblzma/check/crc_x86_clmul.h
index f1254ece..7ecd16ee 100644
--- a/src/liblzma/check/crc_x86_clmul.h
+++ b/src/liblzma/check/crc_x86_clmul.h
@@ -91,6 +91,9 @@ crc_simd_body(const uint8_t *buf, const size_t size, __m128i *v0, __m128i *v1,
        //     [skip_start][size][skip_end]
        //     [     size2      ]
        //
+       // This code can be simplified. skip_start and skip_end are now always 0,
+       // size is a positive multiple of 16 and size2 = size.
+       //
        // A and D are 16-byte aligned. B and C are 1-byte aligned.
        // skip_start and skip_end are 0-15 bytes. size is at least 1 byte.
        //
@@ -253,6 +256,15 @@ crc32_arch_optimized(const uint8_t *buf, size_t size, uint32_t crc)
                return crc;
 #endif

+       // Process any leading bytes (before the 16-byte chunks).
+       crc = ~crc;
+       while (15 & (uintptr_t)buf) {
+               if (size == 0)
+                       return ~crc;
+               crc = lzma_crc32_table[0][*buf++ ^ (crc & 0xFF)] ^ (crc >> 8);
+               --size;
+       }
+
        // uint32_t poly = 0xedb88320;
        const int64_t p = 0x1db710640; // p << 1
        const int64_t mu = 0x1f7011641; // calc_lo(p, p, 32) << 1 | 1
@@ -264,21 +276,34 @@ crc32_arch_optimized(const uint8_t *buf, size_t size, uint32_t crc)
        const __m128i vfold8 = _mm_set_epi64x(0, k5);
        const __m128i vfold16 = _mm_set_epi64x(k4, k3);

-       __m128i v0, v1, v2;
+       // Process 16-byte chunks that are 16-byte aligned.
+       size_t size1 = size & 15;
+       size_t size0 = size - size1;
+       if (size0) {
+               __m128i v0, v1, v2;
+
+               crc_simd_body(buf,  size, &v0, &v1, vfold16,
+                               _mm_cvtsi32_si128((int32_t)~crc));
+               buf += size0;
+
+               v1 = _mm_xor_si128(
+                               _mm_clmulepi64_si128(v0, vfold16, 0x10), v1); // xxx0
+               v2 = _mm_shuffle_epi32(v1, 0xe7); // 0xx0
+               v0 = _mm_slli_epi64(v1, 32);  // [0]
+               v0 = _mm_clmulepi64_si128(v0, vfold8, 0x00);
+               v0 = _mm_xor_si128(v0, v2);   // [1] [2]
+               v2 = _mm_clmulepi64_si128(v0, vfold4, 0x10);
+               v2 = _mm_clmulepi64_si128(v2, vfold4, 0x00);
+               v0 = _mm_xor_si128(v0, v2);   // [2]
+               crc = ~(uint32_t)_mm_extract_epi32(v0, 2);
+       }

-       crc_simd_body(buf,  size, &v0, &v1, vfold16,
-                       _mm_cvtsi32_si128((int32_t)~crc));
-
-       v1 = _mm_xor_si128(
-                       _mm_clmulepi64_si128(v0, vfold16, 0x10), v1); // xxx0
-       v2 = _mm_shuffle_epi32(v1, 0xe7); // 0xx0
-       v0 = _mm_slli_epi64(v1, 32);  // [0]
-       v0 = _mm_clmulepi64_si128(v0, vfold8, 0x00);
-       v0 = _mm_xor_si128(v0, v2);   // [1] [2]
-       v2 = _mm_clmulepi64_si128(v0, vfold4, 0x10);
-       v2 = _mm_clmulepi64_si128(v2, vfold4, 0x00);
-       v0 = _mm_xor_si128(v0, v2);   // [2]
-       return ~(uint32_t)_mm_extract_epi32(v0, 2);
+       // Process any lagging bytes (after the 16-byte chunks).
+       while (size1--) {
+               crc = lzma_crc32_table[0][*buf++ ^ (crc & 0xFF)] ^ (crc >> 8);
+       }
+
+       return ~crc;
 }
 #endif // BUILDING_CRC32_CLMUL

@@ -343,6 +368,8 @@ crc64_arch_optimized(const uint8_t *buf, size_t size, uint64_t crc)
                return crc;
 #endif

+       // Ditto.
+
        // const uint64_t poly = 0xc96c5795d7870f42; // CRC polynomial
        const uint64_t p  = 0x92d8af2baf0e1e85; // (poly << 1) | 1
        const uint64_t mu = 0x9c3e466c172963d5; // (calc_lo(poly) << 1) | 1

Larhzu commented 2 months ago

Disabling sanitzers is indeed scary. But also, I don't think the disabling is required to use x86 SIMD.

This requires that the CRC tables are always available. Currently those can be omitted to reduce library size if appropriate compiler flags are used which indicate unconditional support for the CLMUL instructions. For CRC32 this isn't fully implemented yet though.

Instead, you can add crc_simd_body preconditions that its buf and size arguments must be 16-byte aligned. Make the callers (crc32_arch_optimized and crc64_arch_optimized) responsible for the leading/lagging bytes that aren't 16-byte aligned. Out-of-bounds reads are no longer needed.

This is obviously possible. I kept CRC_USE_GENERIC_FOR_SMALL_INPUTS as a comment because with very tiny buffers the CLMUL version could be worse. So your suggestion would be a step towards that direction. Performance comparison with tiny buffers and different alignment should be done to see that the alternative version is good too.

I'll add this to a list of things to look at.

nigeltao commented 2 months ago

This requires that the CRC tables are always available.

You probably already know this, but... just noting that the patch only needs the 1 x 256 x uint32_t flavor (1 kilobyte of data) of the lzma_crc32_table, not the full 8 x 256 x uint32_t flavor (8 kilobytes). Plus the CRC-64 equivalent, obviously.

nigeltao commented 2 months ago

Ah, copy/paste typo in the diff. The two ~s here should be omitted:

+               crc_simd_body(buf,  size, &v0, &v1, vfold16,
+                               _mm_cvtsi32_si128((int32_t)~crc));
etc.
+               crc = ~(uint32_t)_mm_extract_epi32(v0, 2);

nigeltao commented 2 months ago

https://github.com/JoernEngel/joernblog/blob/778f0007b9a580e477608691f1aa86369f0efdd2/crc64.c might be interesting (i.e. worth copying).

Larhzu commented 2 months ago

Using a 16-byte temporary buffer on stack for the first and last input bytes is certainly a very simple change. But I'll check the other option too, that is, making the SIMD code much shorter and using the table-method for the unaligned beginning and the end.

Larhzu commented 1 month ago

I created the branch crc_edits two weeks ago but I'm not happy with the performance especially with small buffers. I have better code coming which is faster with both big and small buffers and won't trigger sanitizers. :-) It won't be in the bug fix releases as it wouldn't make sense to risk regressions when the current code works correctly.

Note that doing 16-byte-aligned reads is likely a common way in assembly code. The idea is described, for example, in Agner Fog's optimizing_assembly.pdf (2023-Jul-01) page 126 "Reading from the nearest preceding 16-bytes boundary". Sanitizers don't sanitize assembly code and Valgrind is smart enough to see when bytes in the SIMD registers get ignored. Valgrind doesn't complain about the C code either. (Basically every __asm__ statement in the C code is implicitly also no_sanitize_address but implicit doesn't look as scary.)

Also, the current CLMUL code isn't used merely on x86. It is compatible with E2K too. It's quite likely that separate versions are needed as unaligned access (MOVDQU) on x86 seems to fine in terms of performance on processors that support CLMUL. I have no clue about E2K myself but I hope to get feedback from the contributor who tested the current code on E2K.

Larhzu commented 1 month ago

See #127.

tukaani-project / xz

no_sanitize_address isn't required #112