simdutf / simdutf

Unicode routines (UTF8, UTF16, UTF32) and Base64: billions of characters per second using SSE2, AVX2, NEON, AVX-512, RISC-V Vector Extension. Part of Node.js, WebKit/Safari and Bun.
https://simdutf.github.io/simdutf/
Apache License 2.0
1.1k stars 68 forks source link

ISO <-> UTF transcoding #159

Open lemire opened 2 years ago

lemire commented 2 years ago

The different ISO encodings can be transcoded to/from UTF formats.

https://en.m.wikipedia.org/wiki/ISO/IEC_8859-1

clausecker commented 2 years ago

We've talked about this before. It would be interesting to have a transcoder for the general case “single byte ASCII based encoding.” I can try to do that once I'm done with the writeup.

lemire commented 2 years ago

Let me add that the idea should be credited to @clausecker

lemire commented 2 years ago

@clausecker If you assume good AVX-512 support, it seems that vpermi2b would go a long way on this problem.

Supporting it efficiently with AVX/NEON is a fun challenge.

Jarred-Sumner commented 1 year ago

Bun would use this. JavaScript strings are either latin1 or utf16. We frequently need to convert from utf8 (from disk/network) to either latin1 or utf16. Currently, we validate ascii with errors. if ascii, we do a memcpy and if not ascii, we convert to UTF-16 starting at the first non-ascii character. This works okay

lemire commented 1 year ago

Feedback as to the motivation of a feature is important to us.

lemire commented 1 year ago

Computing the UTF-8 size of a Latin 1 string quickly (AVX edition) https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/

lemire commented 11 months ago

We currently fully support Latin1 (IEC_8859-1), the most popular ISO format, in our main branch.

It is unclear whether we should extend to other European ISO formats. My suspicions is that it would see little use.

I am thinking about closing this issue.

RichardSteele commented 2 weeks ago

Windows-1252 or CP-1252 would be nice for legacy support. That is what is behind MySQL's previous default character set of latin1. Also, the HTML5 standard says to treat ISO 8859-1 as Windows-1252.

Windows-1252 differs from ISO 8859-1 by using additional characters instead of control codes in the 0x80 to 0x9F range.

clausecker commented 2 weeks ago

This is something I am interested in and I'll look into writing some code for the case "8-bit encoding to UTF-8/UTF-16/UTF-32". I don't have any time for it right now though.

RichardSteele @.***> schrieb am Do., 29. Aug. 2024, 09:26:

Windows-1252 or CP-1252 would be nice for legacy support. That is what is behind MySQL's previous default character set of latin1. Also, the HTML5 standard says to treat ISO 8859-1 as Windows-1252.

Windows-1252 differs from ISO 8859-1 by using additional characters instead of control codes in the 0x80 to 0x9F range.

— Reply to this email directly, view it on GitHub https://github.com/simdutf/simdutf/issues/159#issuecomment-2316896551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQF4PV3FYPR2Q7T6DIVRTZT3EJFAVCNFSM6AAAAABNJZHNOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWHA4TMNJVGE . You are receiving this because you were mentioned.Message ID: @.***>