llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.87k stars 11.93k forks source link

__hwasan_test_shadow/CheckAddress performance issues #94124

Open easyaspi314 opened 5 months ago

easyaspi314 commented 5 months ago

I was experimenting on doing checks beforehand for HWAsan in some high performance code and I realized that checking beforehand was much slower than software ASan.

That seems to be because while __asan_region_is_poisoned has a fast path that checks the start and end before doing a slow O(n) scan, the near equivalent __hwasan_test_shadow is always a slow O(n) byte by byte loop.

It is also a simple byte loop so, assuming page boundaries are accounted for, both could probably be converted to a fast SIMD loop like strlen or memchr. This could make it a lot more viable for checking bounds in advance.

easyaspi314 commented 5 months ago

I drafted up a quick version of the __hwasan_test_shadow for NEON for benchmarking. Alignment and thorough testing hasn't been done yet.

It's a simple double wide NEON byte search loop and it is easily portable to SSE2.

sptr hwasan_test_neon2(const void *p, uptr sz)
{
  if (sz == 0)
    return -1;
  uptr ptr = reinterpret_cast<uptr>(p);
  tag_t ptr_tag = GetTagFromPointer(ptr);
  uptr ptr_raw = UntagAddr(ptr);
  uptr shadow_first = MemToShadow(ptr_raw);
  uptr shadow_last = MemToShadow(ptr_raw + sz);
  sptr shadow_len = shadow_last - shadow_first;
  uptr s = shadow_first;

  if (shadow_len >= 32) {
      uint8x16_t mask = vdupq_n_u8(ptr_tag);

      while (s + 32 <= shadow_last) {
          uint8x16_t tags_low = vld1q_u8((const uint8_t *)s);
          uint8x16_t tags_high = vld1q_u8((const uint8_t *)s + 16);
          uint8x16_t compare_low = vceqq_u8(mask, tags_low);
          uint8x16_t compare_high = vceqq_u8(mask, tags_high);
          uint8x16_t paired_low = vpminq_u8(compare_low, compare_low);
          uint8x16_t paired_high = vpminq_u8(compare_high, compare_high);
          uint8x16_t paired = vorrq_u8(paired_low, paired_high);
          uint64_t raw = vgetq_lane_u64(vreinterpretq_u64_u8(paired), 0);
          if (__builtin_expect(raw != -1ull, 0)) {
              uint8x8_t packed_low = vshrn_n_u16(vreinterpretq_u8_u16(compare_low), 4);
              uint8x16_t packed = vshrn_high_n_u16(packed_low, vreinterpretq_u8_u16(compare_high), 4);
              uint64_t raw_low = vgetq_lane_u64(vreinterpretq_u8_u64(packed), 0);
              uint64_t raw_high = vgetq_lane_u64(vreinterpretq_u8_u64(packed), 1);
              uint64_t real_raw = raw_low;
              if (raw_low == -1ull) {
                  real_raw = raw_high;
                  s += 16;
              }
              unsigned mismatch = __builtin_ctzll(~real_raw) >> 2;
              s += mismatch;
              uptr short_size =
                  ShortTagSize(*(tag_t *)s, AddTagToPointer(ShadowToMem(s), ptr_tag));
              sptr offset = ShadowToMem(s) - ptr_raw + short_size;
              return offset < 0 ? 0 : offset;
          }
          s += 32;
      }
  }

  while (shadow_last - s >= 8) {
    uint64_t tags = *(const uint64_t *)s;
    if (__builtin_expect(tags != 0x0101010101010101 * ptr_tag, 0))
        break;
    s += 8;
   }
  for (; s < shadow_last; ++s) {
    if (*(tag_t *)s != ptr_tag) {
      uptr short_size =
          ShortTagSize(*(tag_t *)s, AddTagToPointer(ShadowToMem(s), ptr_tag));
      sptr offset = ShadowToMem(s) - ptr_raw + short_size;
      return offset < 0 ? 0 : offset;
    }
  }

  uptr end = ptr + sz;
  uptr tail_sz = end & (kShadowAlignment - 1);
  if (!tail_sz)
    return -1;

  uptr short_size =
      ShortTagSize(*(tag_t *)shadow_last, end & ~(kShadowAlignment - 1));
  if (__builtin_expect(tail_sz <= short_size, 1))
    return -1;

  sptr offset = sz - tail_sz + short_size;
  return offset < 0 ? 0 : offset;
}

On a Pixel 6a (Cortex-X1), the default implementation can test 30 GB/s max, while the double wide NEON peaks at 650 GB/s and seems to run at the speed of cache (~40 GB/s on the shadow memory).

Screenshot_20240602-101210.png

So there is a lot of room for optimization even if there isn't an O(1) shortcut on HWAsan.

fmayer commented 5 months ago

I was experimenting on doing checks beforehand

Just confirming, you mean checking the whole region at once rather than every access?

easyaspi314 commented 5 months ago

I was experimenting on doing checks beforehand

Just confirming, you mean checking the whole region at once rather than every access?

Yes. Basically, what I was trying to do was use __asan_region_is_poisoned or __hwasan_test_shadow at the start and just saying "this pointer is trusted" and run the rest of the code with ASAN off.

It resulted in classic ASAN running at near full speed since the range check was almost instant, but HWAsan was running much slower because it still has to check every shadow byte one at a time.

Also I checked ASan's code and mem_is_zero is vectorized to use pointer sized loads (which is further auto vectorized by LLVM).

In addition, this issue also applies to the internal range checks, e.g. in __hwasan_loadN

easyaspi314 commented 5 months ago

Actually, porting mem_is_zero and letting clang autovectorize has similar results to the two wide NEON version at the cost of code size and running unconditionally.

bool mem_is_all(const u8 *beg, uptr size, u8 value) {
  const u8 *end = beg + size;
  uptr *aligned_beg = (uptr *)RoundUpTo((uptr)beg, sizeof(uptr));
  uptr *aligned_end = (uptr *)RoundDownTo((uptr)end, sizeof(uptr));
  uptr all = 0;
  u8 all_narrow = 0; // u8 avoids pointless autovectorization on prologue/epilogue
  uptr widemask = (uptr)0x0101010101010101 * value;

  // Prologue.
  for (const u8 *mem = beg; mem < (u8*)aligned_beg && mem < end; mem++)
    all_narrow |= *mem ^ value;
  // Aligned loop.
  for (; aligned_beg < aligned_end; aligned_beg++)
    all |= *aligned_beg ^ widemask;
  // Epilogue.
  if ((u8 *)aligned_end >= beg) {
    for (const u8 *mem = (u8 *)aligned_end; mem < end; mem++)
      all_narrow |= *mem ^ value;
  }
  return (all | all_narrow) == 0;
}
vitalybuka commented 5 months ago

For hwasan we just need to check first and last granule, we don't have to bother with stuff in the middle.

easyaspi314 commented 5 months ago

So there is no reason why it isn't doing the same logic as __asan_region_is_poisoned?

Edit: ASAN is actually O(n) as well, but because it has an optimized implementation it doesn't have nearly as much performance overhead as HWASAN.

vitalybuka commented 4 months ago

Yes, no reason, it just not implemented. On average HWASAN is faster, in most cases. But there are still opportunities to optimize more, like this one.

Particularly this one is known, but are busy with other stuff, and hope to fix this sooner or later.

BTW. You're welcome to send patches as well!