Closed Veedrac closed 6 years ago
Great work as always! Even in this implementation, we win against naive count from ~80 chars upwards. Not too shabby for a proof of concept. :+1:
I made a bunch of graphs (the tabs are near the bottom of the page).
https://docs.google.com/spreadsheets/d/1MMPiAPwFZvW8_jfiz5qqM8dpDNGppRrGNxtNUNgUa3E/edit?usp=sharing
Given these results I'm not sure there's much that can be done to improve timings; there is some stuff I could try but it doesn't look like a major benefit timing-wise.
So when/if you're happy with the code correctness-wise, might as well submit it as-is.
OK then, I can still benchmark things if I find the time. No need to delay the merge. I'll make a new release soon.
I have a non-SIMD version that shaves a few cycles. Need yo clean it up and bring the other versions in line.
No tuning has been done whatsoever. I also have only the hastiest, barest of tests.
Without SIMD
SIMD
AVX