This is a very small improvement, but nonetheless I think it's worth it. The improvement comes in two steps:
inverting the count logic so we can count from zero and no longer need to subtract slice lengths
for the non-simd case, I use simple shifts and and/or operations, instead of the rather complex equality check.
In my benchmarks I see a consistent improvement in the non-simd case for all char counts we win against naive and a minor improvement in some cases for the simd and avx cases. Even if it amounts to almost nothing, it will likely reduce the code by a few bytes, which may help with inlining and caching.
This is a very small improvement, but nonetheless I think it's worth it. The improvement comes in two steps:
In my benchmarks I see a consistent improvement in the non-simd case for all char counts we win against naive and a minor improvement in some cases for the simd and avx cases. Even if it amounts to almost nothing, it will likely reduce the code by a few bytes, which may help with inlining and caching.