Closed Lucccyo closed 1 year ago
IIUC, the problem is that the ranges '\xe0' .. '\xef'
and '\xf0' .. '\xf7'
contain invalid characters (namely U+10FFFF
and U+D800
to U+DFFF
from the description of the PR).
Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?
IIUC, the problem is that the ranges
'\xe0' .. '\xef'
and'\xf0' .. '\xf7'
contain invalid characters (namelyU+10FFFF
andU+D800
toU+DFFF
from the description of the PR).Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?
I think it is. Is it better than using Uchar.is_valid
?
We currently have four ranges (\x00
-\x7f
, \xc0
-\xdf
,\xe0
-\xef
and \xf0
-\xf7
) representing the four UTF8 encoding lengths (respectively 1 byte, 2 bytes, 3 bytes, and 4 bytes).
May adding new ranges to avoid the use of Uchar.is_valid
make Zed_string.of_utf8
less understandable?
I think it is. Is it better than using Uchar.is_valid?
I'm not sure. Having a pattern match on the range and an additional validity check feels a bit strange: I would expect to see one or the other, but why check for the range if, in the end, the range contains invalid characters?
On the other hand, having a final check will be more robust and prevent similar issues in the future.
Unless someone else has an opinion on this (cc @emillon), I'll let you decide what's better here
I think that there's something that's causing confusion but is not immediately clear from reading the code: on the one hand, there are byte ranges (what we're pattern matching on, the char
type); on the other hand, there are codepoint ranges (based on Uchar.t
). The set of valid Uchar.t
values is 2 disjoint ranges (or, put differently, a range with a range-shaped hole in it). To convert from one to the other, some bit operations are necessary, so it would be awkward to do that at the byte level.
To make an analogy, decimal IPv4 addresses look like \d+\.\d+.\d+.\d+
. You can somehow restrict the regexp to accept a maximum of 3 ints, or restrict the leading int to {1,2}
but an explicit step with int
is easier to reason about.
So yes I think these changes are good.
According to the documentation (https://ocaml.org/p/zed/2.0.6/doc/Zed_string/index.html#val-of_utf8),
Zed_string.of_utf8
should not raiseStdlib.Invalid_argument
if the input is not valid.Uchar.is_valid
returnsfalse
if the value is bigger thanUchar.max
(U+10FFFF
) or belongs to the invalid Unicode range (fromU+D800
toU+DFFF
). In this case,Zed_string.of_utf8
raisesZed_utf8.Invalid (input, "not a Unicode scalar value")
.