Open BlueAmulet opened 3 months ago
Do you have a fix for it?
I just subtracted the size by 31, which isn't really a fix but more of a workaround.
@BlueAmulet How about this it works for me let me know if it works for you too.
ScanResult FindAvx2(const Pattern& patternData, void* startAddr, size_t size) {
constexpr size_t UNIT_SIZE = 32;
size_t processedSize = 0;
__m256i pattern = _mm256_load_si256((__m256i*)patternData.data.data());
__m256i mask = _mm256_load_si256((__m256i*)patternData.mask.data());
__m256i allZeros = _mm256_set1_epi8(0x00);
size_t chunk = 0;
for (; chunk + UNIT_SIZE <= size; chunk += UNIT_SIZE) {
__m256i chunkData = _mm256_loadu_si256((__m256i*)((char*)startAddr + chunk));
__m256i blend = _mm256_blendv_epi8(allZeros, chunkData, mask);
__m256i eq = _mm256_cmpeq_epi8(pattern, blend);
if (_mm256_movemask_epi8(eq) == 0xffffffff) {
processedSize += UNIT_SIZE;
if (processedSize < patternData.unpaddedSize) {
pattern = _mm256_load_si256((__m256i*)(patternData.data.data() + processedSize));
mask = _mm256_load_si256((__m256i*)(patternData.mask.data() + processedSize));
} else {
char* matchAddr = (char*)startAddr + chunk - processedSize + UNIT_SIZE;
return ScanResult((void*)matchAddr);
}
} else {
pattern = _mm256_load_si256((__m256i*)patternData.data.data());
mask = _mm256_load_si256((__m256i*)patternData.mask.data());
processedSize = 0;
}
}
if (chunk < size) {
size_t remainingBytes = size - chunk;
__m256i chunkData = _mm256_loadu_si256((__m256i*)((char*)startAddr + chunk));
__m256i remainingMask = _mm256_set1_epi8(0x00);
for (size_t i = 0; i < remainingBytes; ++i) {
((char*)&remainingMask)[i] = 0xFF;
}
__m256i blend = _mm256_blendv_epi8(allZeros, chunkData, remainingMask);
__m256i eq = _mm256_cmpeq_epi8(pattern, blend);
if (_mm256_movemask_epi8(eq) == 0xffffffff) {
char* matchAddr = (char*)startAddr + chunk;
return ScanResult((void*)matchAddr);
}
}
return ScanResult(nullptr);
}
Fix for this as well as performance is planned, but I am a bit busy lately, will try to get it out soon
@localcc would be great, My solution did not work.
The AVX2 scanner reads 32bytes at once, so as
chunk
approaches the end ofsize
, it ends up reading past the end of the buffer https://github.com/localcc/LightningScanner/blob/76e59b68c495f31b46438841553c5ae0bcdbfab3/src/backends/Avx2.cpp#L15-L17The SSE4.2 scanner also has the same issue. https://github.com/localcc/LightningScanner/blob/76e59b68c495f31b46438841553c5ae0bcdbfab3/src/backends/Sse42.cpp#L15-L17
This can cause crashes if there is no readable memory past the end of the buffer.