riscv / riscv-bitmanip

Working draft of the proposed RISC-V Bitmanipulation extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
204 stars 65 forks source link

UTF encoding/decoding instructions #178

Open CAFxX opened 2 years ago

CAFxX commented 2 years ago

Sorry for the driveby question, I tried searching on the ML and in the existing issues but could not find any previous discussion about this. If this has already been answered, any pointers to the relevant resource(s) would be greatly appreciated.


Was it ever considered/discussed to add, likely in a new dedicated subset of this extension, instructions for encoding/decoding UTF-8 (and ideally also UTF-16 and UTF-32)? Most text processed today is in one of those encodings[^t], and there is little on the horizon that would suggest upcoming changes to this status quo; decoding/encoding UTF is not especially complicated without dedicated instructions (and the existing bitmanip instructions can already help), but given the ubiquity of these encodings and the relative underlying logical simplicity of the coding process (at their heart, UTF-8 and UTF-16 are simple-to-decode VLEs) there may be efficiency benefits[^e] to be obtained with dedicated support.

Just for the sake of clarity, in its simplest form (covering only UTF-8 → codepoint decoding) this would require a single instruction that takes a 4 bytes input (the maximum length of a UTF-8 encoded codepoint, likely obtained via an unaligned read from memory), and returns the decoded Unicode codepoint (3 bytes), how many bytes of the input were consumed (between 1 and 4, included), and whether the decoding encountered an error (the necessity to return multiple values is probably the biggest roadblock to inclusion in the ISA, albeit I suspect there may be workarounds).

Extensions to the simplest form could include, as hinted to above:

Going further, it is potentially even possible to imagine an expansion (outside of this extension) to a packed SIMD version[^p] of the same operations, able to {de|en}code multiple codepoints at the same time.

[^b]: i.e. the ability to decode a codepoint knowing where the last byte of the encoded representation is (instead of knowing where the first byte of the encoded representation is); this is useful when iterating backwards over text [^t]: and this includes resources with text representation even if not exclusively meant for direct human consumption, like JSON, CSV, HTML, and other source code [^e]: while the English-speaking world may have historically been fine assuming that most text would be quickly parseable in the ASCII-subset of UTF-8, so the need for efficient non-ASCII codepoints handling was lesser, this has never been true in the rest of the world [^p]: or even a vector version, albeit this would possibly require a prohibitively high gate count for any reasonable VLEN

svobodnik commented 1 year ago

CISC-V when?