localcc / LightningScanner

A lightning-fast memory pattern scanner, capable of scanning gigabytes of data per second.
MIT License
27 stars 6 forks source link

Non scalar scanners overrun buffer #1

Open BlueAmulet opened 3 months ago

BlueAmulet commented 3 months ago

The AVX2 scanner reads 32bytes at once, so as chunk approaches the end of size, it ends up reading past the end of the buffer https://github.com/localcc/LightningScanner/blob/76e59b68c495f31b46438841553c5ae0bcdbfab3/src/backends/Avx2.cpp#L15-L17

The SSE4.2 scanner also has the same issue. https://github.com/localcc/LightningScanner/blob/76e59b68c495f31b46438841553c5ae0bcdbfab3/src/backends/Sse42.cpp#L15-L17

This can cause crashes if there is no readable memory past the end of the buffer.

CycloneRing commented 2 months ago

Do you have a fix for it?

BlueAmulet commented 2 months ago

I just subtracted the size by 31, which isn't really a fix but more of a workaround.

CycloneRing commented 2 months ago

@BlueAmulet How about this it works for me let me know if it works for you too.

ScanResult FindAvx2(const Pattern& patternData, void* startAddr, size_t size) {
    constexpr size_t UNIT_SIZE = 32;

    size_t processedSize = 0;

    __m256i pattern = _mm256_load_si256((__m256i*)patternData.data.data());
    __m256i mask = _mm256_load_si256((__m256i*)patternData.mask.data());
    __m256i allZeros = _mm256_set1_epi8(0x00);

    size_t chunk = 0;
    for (; chunk + UNIT_SIZE <= size; chunk += UNIT_SIZE) {
        __m256i chunkData = _mm256_loadu_si256((__m256i*)((char*)startAddr + chunk));

        __m256i blend = _mm256_blendv_epi8(allZeros, chunkData, mask);
        __m256i eq = _mm256_cmpeq_epi8(pattern, blend);

        if (_mm256_movemask_epi8(eq) == 0xffffffff) {
            processedSize += UNIT_SIZE;

            if (processedSize < patternData.unpaddedSize) {
                pattern = _mm256_load_si256((__m256i*)(patternData.data.data() + processedSize));
                mask = _mm256_load_si256((__m256i*)(patternData.mask.data() + processedSize));
            } else {
                char* matchAddr = (char*)startAddr + chunk - processedSize + UNIT_SIZE;
                return ScanResult((void*)matchAddr);
            }
        } else {
            pattern = _mm256_load_si256((__m256i*)patternData.data.data());
            mask = _mm256_load_si256((__m256i*)patternData.mask.data());
            processedSize = 0;
        }
    }

    if (chunk < size) {
        size_t remainingBytes = size - chunk;
        __m256i chunkData = _mm256_loadu_si256((__m256i*)((char*)startAddr + chunk));

        __m256i remainingMask = _mm256_set1_epi8(0x00);
        for (size_t i = 0; i < remainingBytes; ++i) {
            ((char*)&remainingMask)[i] = 0xFF;
        }

        __m256i blend = _mm256_blendv_epi8(allZeros, chunkData, remainingMask);
        __m256i eq = _mm256_cmpeq_epi8(pattern, blend);

        if (_mm256_movemask_epi8(eq) == 0xffffffff) {
            char* matchAddr = (char*)startAddr + chunk;
            return ScanResult((void*)matchAddr);
        }
    }

    return ScanResult(nullptr);
}
localcc commented 2 months ago

Fix for this as well as performance is planned, but I am a bit busy lately, will try to get it out soon

CycloneRing commented 2 months ago

@localcc would be great, My solution did not work.