`of_utf8`: add `Uchar.is_valid` to check the input

ocaml-community / zed

Abstract engine for text edition in OCaml

Other

122 stars 16 forks source link

`of_utf8`: add `Uchar.is_valid` to check the input #51

Closed Lucccyo closed 1 year ago

Lucccyo commented 1 year ago

According to the documentation (https://ocaml.org/p/zed/2.0.6/doc/Zed_string/index.html#val-of_utf8), Zed_string.of_utf8 should not raise Stdlib.Invalid_argument if the input is not valid. Uchar.is_valid returns false if the value is bigger than Uchar.max (U+10FFFF) or belongs to the invalid Unicode range (from U+D800 to U+DFFF). In this case, Zed_string.of_utf8 raises Zed_utf8.Invalid (input, "not a Unicode scalar value").

tmattio commented 1 year ago

IIUC, the problem is that the ranges '\xe0' .. '\xef' and '\xf0' .. '\xf7' contain invalid characters (namely U+10FFFF and U+D800 to U+DFFF from the description of the PR).

Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?

Lucccyo commented 1 year ago

IIUC, the problem is that the ranges '\xe0' .. '\xef' and '\xf0' .. '\xf7' contain invalid characters (namely U+10FFFF and U+D800 to U+DFFF from the description of the PR).

Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?

I think it is. Is it better than using Uchar.is_valid?

Lucccyo commented 1 year ago

We currently have four ranges (\x00-\x7f, \xc0-\xdf,\xe0-\xef and \xf0-\xf7) representing the four UTF8 encoding lengths (respectively 1 byte, 2 bytes, 3 bytes, and 4 bytes). May adding new ranges to avoid the use of Uchar.is_valid make Zed_string.of_utf8 less understandable?

tmattio commented 1 year ago

I think it is. Is it better than using Uchar.is_valid?

I'm not sure. Having a pattern match on the range and an additional validity check feels a bit strange: I would expect to see one or the other, but why check for the range if, in the end, the range contains invalid characters?

On the other hand, having a final check will be more robust and prevent similar issues in the future.

Unless someone else has an opinion on this (cc @emillon), I'll let you decide what's better here

emillon commented 1 year ago

I think that there's something that's causing confusion but is not immediately clear from reading the code: on the one hand, there are byte ranges (what we're pattern matching on, the char type); on the other hand, there are codepoint ranges (based on Uchar.t). The set of valid Uchar.t values is 2 disjoint ranges (or, put differently, a range with a range-shaped hole in it). To convert from one to the other, some bit operations are necessary, so it would be awkward to do that at the byte level.

To make an analogy, decimal IPv4 addresses look like \d+\.\d+.\d+.\d+. You can somehow restrict the regexp to accept a maximum of 3 ints, or restrict the leading int to {1,2} but an explicit step with int is easier to reason about.

emillon commented 1 year ago

So yes I think these changes are good.