What pext does is different from what the pext instruction does

When I tried certain inputs and did not enable -mbmi2, my mask was calculated as 163839. The compilation crashed because the nbits calculation produced an extremely large number. And I found that the issue was caused by an incorrect clz calculaiton.

constexpr auto clz = [](auto x) {
  decltype(x) n{};
  if (not x) return decltype(x)(sizeof(x) * __CHAR_BIT__);
  while (x >>= 1u) n++;
  return n;
};
constexpr auto mbits = __builtin_popcountl(mask);
constexpr auto nbits = size - mbits - clz(mask.value);
constexpr auto coefficient = [&] {
  auto set = false;
  auto dst = nbits;
  T result{};
  for (auto i = 0u; i < size; ++i) {
    const auto curr = ((T(1) << i) & mask) != T();
    if (curr and not set) result = result | (T(1) << (dst - i));
    dst += curr;
    set = curr;
  }
  return result;
}();
return (((a & mask) * coefficient) >> nbits) & ((T(1) << mbits) - T(1));

I fixed the error and submitted a pull request.

In addition, I found that the pext algorithm here is not correct. The article from the link you provided mentions that when the gaps between the bits we are interested in are too small, multiplying by a coefficient can cause addition that leads to a carry, which makes the result incorrect.

Assuming the mask is 21 and a is also 21, the coefficient will become (4 + 2 + 1) = 7 but the result will be 4 while it should be 7.

    0 0 0 1 0 1 0 1   (value << 0)
    0 0 1 0 1 0 1 0   (value << 1)
    0 1 0 1 0 1 0 0   (value << 2)    +
---------------------------------------

You can easily generate the mask as 21 using the following input and the lookup result will be incorrect.

static constexpr std::array inputs {
    std::pair {0u, 0u},    // 00000
    std::pair {1u, 1u},    // 00001
    std::pair {4u, 2u},    // 00100
    std::pair {5u, 3u},    // 00101
    std::pair {16u, 4u},   // 10000
    std::pair {17u, 5u},   // 10001
    std::pair {20u, 6u},   // 10100
    std::pair {21u, 7u},   // 10101
};

I also checked the Intel repository link you provided, and it has the same issue with pseudo_pext_t. However, their perfect hash lookup does not produce errors because they use pseudo_pext_t to ensure that hash values are unique when selecting masks. In contrast, MPH only checks the key & mask. I also posted an issue there, but I'm not sure if this was an intentional design choice.

Since I’m unsure how you would like the modification to be made, if the goal is to ensure mask selection like Intel's approach, the behavior of the pext function still appears inconsistent and somewhat strange. Ensuring adequate gaps between mask bits would increase the computation, which doesn’t seem like a very practical solution. Therefore, for now, I have only fixed the clz function.

qlibs / mph

What pext does is different from what the pext instruction does #8