Open easyaspi314 opened 5 months ago
I drafted up a quick version of the __hwasan_test_shadow for NEON for benchmarking. Alignment and thorough testing hasn't been done yet.
It's a simple double wide NEON byte search loop and it is easily portable to SSE2.
sptr hwasan_test_neon2(const void *p, uptr sz)
{
if (sz == 0)
return -1;
uptr ptr = reinterpret_cast<uptr>(p);
tag_t ptr_tag = GetTagFromPointer(ptr);
uptr ptr_raw = UntagAddr(ptr);
uptr shadow_first = MemToShadow(ptr_raw);
uptr shadow_last = MemToShadow(ptr_raw + sz);
sptr shadow_len = shadow_last - shadow_first;
uptr s = shadow_first;
if (shadow_len >= 32) {
uint8x16_t mask = vdupq_n_u8(ptr_tag);
while (s + 32 <= shadow_last) {
uint8x16_t tags_low = vld1q_u8((const uint8_t *)s);
uint8x16_t tags_high = vld1q_u8((const uint8_t *)s + 16);
uint8x16_t compare_low = vceqq_u8(mask, tags_low);
uint8x16_t compare_high = vceqq_u8(mask, tags_high);
uint8x16_t paired_low = vpminq_u8(compare_low, compare_low);
uint8x16_t paired_high = vpminq_u8(compare_high, compare_high);
uint8x16_t paired = vorrq_u8(paired_low, paired_high);
uint64_t raw = vgetq_lane_u64(vreinterpretq_u64_u8(paired), 0);
if (__builtin_expect(raw != -1ull, 0)) {
uint8x8_t packed_low = vshrn_n_u16(vreinterpretq_u8_u16(compare_low), 4);
uint8x16_t packed = vshrn_high_n_u16(packed_low, vreinterpretq_u8_u16(compare_high), 4);
uint64_t raw_low = vgetq_lane_u64(vreinterpretq_u8_u64(packed), 0);
uint64_t raw_high = vgetq_lane_u64(vreinterpretq_u8_u64(packed), 1);
uint64_t real_raw = raw_low;
if (raw_low == -1ull) {
real_raw = raw_high;
s += 16;
}
unsigned mismatch = __builtin_ctzll(~real_raw) >> 2;
s += mismatch;
uptr short_size =
ShortTagSize(*(tag_t *)s, AddTagToPointer(ShadowToMem(s), ptr_tag));
sptr offset = ShadowToMem(s) - ptr_raw + short_size;
return offset < 0 ? 0 : offset;
}
s += 32;
}
}
while (shadow_last - s >= 8) {
uint64_t tags = *(const uint64_t *)s;
if (__builtin_expect(tags != 0x0101010101010101 * ptr_tag, 0))
break;
s += 8;
}
for (; s < shadow_last; ++s) {
if (*(tag_t *)s != ptr_tag) {
uptr short_size =
ShortTagSize(*(tag_t *)s, AddTagToPointer(ShadowToMem(s), ptr_tag));
sptr offset = ShadowToMem(s) - ptr_raw + short_size;
return offset < 0 ? 0 : offset;
}
}
uptr end = ptr + sz;
uptr tail_sz = end & (kShadowAlignment - 1);
if (!tail_sz)
return -1;
uptr short_size =
ShortTagSize(*(tag_t *)shadow_last, end & ~(kShadowAlignment - 1));
if (__builtin_expect(tail_sz <= short_size, 1))
return -1;
sptr offset = sz - tail_sz + short_size;
return offset < 0 ? 0 : offset;
}
On a Pixel 6a (Cortex-X1), the default implementation can test 30 GB/s max, while the double wide NEON peaks at 650 GB/s and seems to run at the speed of cache (~40 GB/s on the shadow memory).
So there is a lot of room for optimization even if there isn't an O(1) shortcut on HWAsan.
I was experimenting on doing checks beforehand
Just confirming, you mean checking the whole region at once rather than every access?
I was experimenting on doing checks beforehand
Just confirming, you mean checking the whole region at once rather than every access?
Yes. Basically, what I was trying to do was use __asan_region_is_poisoned
or __hwasan_test_shadow
at the start and just saying "this pointer is trusted" and run the rest of the code with ASAN off.
It resulted in classic ASAN running at near full speed since the range check was almost instant, but HWAsan was running much slower because it still has to check every shadow byte one at a time.
Also I checked ASan's code and mem_is_zero
is vectorized to use pointer sized loads (which is further auto vectorized by LLVM).
In addition, this issue also applies to the internal range checks, e.g. in __hwasan_loadN
Actually, porting mem_is_zero and letting clang autovectorize has similar results to the two wide NEON version at the cost of code size and running unconditionally.
bool mem_is_all(const u8 *beg, uptr size, u8 value) {
const u8 *end = beg + size;
uptr *aligned_beg = (uptr *)RoundUpTo((uptr)beg, sizeof(uptr));
uptr *aligned_end = (uptr *)RoundDownTo((uptr)end, sizeof(uptr));
uptr all = 0;
u8 all_narrow = 0; // u8 avoids pointless autovectorization on prologue/epilogue
uptr widemask = (uptr)0x0101010101010101 * value;
// Prologue.
for (const u8 *mem = beg; mem < (u8*)aligned_beg && mem < end; mem++)
all_narrow |= *mem ^ value;
// Aligned loop.
for (; aligned_beg < aligned_end; aligned_beg++)
all |= *aligned_beg ^ widemask;
// Epilogue.
if ((u8 *)aligned_end >= beg) {
for (const u8 *mem = (u8 *)aligned_end; mem < end; mem++)
all_narrow |= *mem ^ value;
}
return (all | all_narrow) == 0;
}
For hwasan we just need to check first and last granule, we don't have to bother with stuff in the middle.
So there is no reason why it isn't doing the same logic as __asan_region_is_poisoned
?
Edit: ASAN is actually O(n) as well, but because it has an optimized implementation it doesn't have nearly as much performance overhead as HWASAN.
Yes, no reason, it just not implemented. On average HWASAN is faster, in most cases. But there are still opportunities to optimize more, like this one.
Particularly this one is known, but are busy with other stuff, and hope to fix this sooner or later.
BTW. You're welcome to send patches as well!
I was experimenting on doing checks beforehand for HWAsan in some high performance code and I realized that checking beforehand was much slower than software ASan.
That seems to be because while
__asan_region_is_poisoned
has a fast path that checks the start and end before doing a slow O(n) scan, the near equivalent__hwasan_test_shadow
is always a slow O(n) byte by byte loop.It is also a simple byte loop so, assuming page boundaries are accounted for, both could probably be converted to a fast SIMD loop like strlen or memchr. This could make it a lot more viable for checking bounds in advance.