Open lemire opened 2 years ago
We've talked about this before. It would be interesting to have a transcoder for the general case “single byte ASCII based encoding.” I can try to do that once I'm done with the writeup.
Let me add that the idea should be credited to @clausecker
@clausecker If you assume good AVX-512 support, it seems that vpermi2b would go a long way on this problem.
Supporting it efficiently with AVX/NEON is a fun challenge.
Bun would use this. JavaScript strings are either latin1 or utf16. We frequently need to convert from utf8 (from disk/network) to either latin1 or utf16. Currently, we validate ascii with errors. if ascii, we do a memcpy and if not ascii, we convert to UTF-16 starting at the first non-ascii character. This works okay
Feedback as to the motivation of a feature is important to us.
Computing the UTF-8 size of a Latin 1 string quickly (AVX edition) https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/
We currently fully support Latin1 (IEC_8859-1), the most popular ISO format, in our main branch.
It is unclear whether we should extend to other European ISO formats. My suspicions is that it would see little use.
I am thinking about closing this issue.
Windows-1252 or CP-1252 would be nice for legacy support. That is what is behind MySQL's previous default character set of latin1. Also, the HTML5 standard says to treat ISO 8859-1 as Windows-1252.
Windows-1252 differs from ISO 8859-1 by using additional characters instead of control codes in the 0x80 to 0x9F range.
This is something I am interested in and I'll look into writing some code for the case "8-bit encoding to UTF-8/UTF-16/UTF-32". I don't have any time for it right now though.
RichardSteele @.***> schrieb am Do., 29. Aug. 2024, 09:26:
Windows-1252 or CP-1252 would be nice for legacy support. That is what is behind MySQL's previous default character set of latin1. Also, the HTML5 standard says to treat ISO 8859-1 as Windows-1252.
Windows-1252 differs from ISO 8859-1 by using additional characters instead of control codes in the 0x80 to 0x9F range.
— Reply to this email directly, view it on GitHub https://github.com/simdutf/simdutf/issues/159#issuecomment-2316896551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQF4PV3FYPR2Q7T6DIVRTZT3EJFAVCNFSM6AAAAABNJZHNOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWHA4TMNJVGE . You are receiving this because you were mentioned.Message ID: @.***>
The different ISO encodings can be transcoded to/from UTF formats.
https://en.m.wikipedia.org/wiki/ISO/IEC_8859-1