Revisit UTF-8 validation

essen commented 6 months ago

The code in https://github.com/ninenines/cowlib/blob/cc04201c1d0e1d5603cd1cde037ab729b192634c/src/cow_ws.erl#L581-L588 was written a decade ago. The VM has changed a lot. The JSON PR in OTP has a different way of doing this that may be faster: https://github.com/erlang/otp/pull/8111

codeadict commented 6 months ago

For extra info, there is also this discussion about adding a C BIF to the BEAM using this algorithm https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/

essen commented 6 months ago

OK I took a long look at all that chatter about UTF-8 validation that I missed (including https://github.com/erlang/otp/pull/6576 as fairly interesting). Thank you.

As far as SIMD goes, I am open to believe it could be a better alternative, but it remains to be proven for use within Erlang. Note that some strings can be overly long so the implementation would need to account for that. This might make it not as good as initially hoped.

The Elixir PR adding a fast_ascii option sounds good but as far as Cowboy is concerned users that want to skip this validation (because it will be done when decoding JSON, for example) should use a binary frame. Other users that do use text frames are more likely to use more than just ASCII. At least that's what I've experienced.

So for now the ticket is about refreshing the algorithm implementation rather than switching to a different algorithm. But it's possible that I missed something; I didn't actually start working on this and it is not yet a priority.

ninenines / cowlib

Revisit UTF-8 validation #136