I'm trying to use compressedbytes to get the allocation size beforehand but this turned out to be awfully slow and taking more time than just doing the (SIMD-accelerated) encoding itself.
This PR fixes that and does a few things:
It makes scalar compressedbytes branchless so that it's less awful.
It adds a SSE41 path for compressedbytes, reusing mask calculation from the encode path.
Sorry, but there is no implementation for:
ARM NEON (don't have a machine to test)
0124 path (don't have unit tests and the code structure is somewhat different).
I'm trying to use compressedbytes to get the allocation size beforehand but this turned out to be awfully slow and taking more time than just doing the (SIMD-accelerated) encoding itself.
This PR fixes that and does a few things: