unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Decide expressiveness of UnicodeSet parsing errors #3558

Open skius opened 1 year ago

skius commented 1 year ago

How expressive should UnicodeSet parse errors be? Does it suffice if we show which character at which position was the issue, or should we give precise information also about what we expected? (i.e., "\xag<-- error: was parsing an \x-escape, expected precisely two hex-characters, got 'g'")

EDIT: Examples of current parse errors: https://github.com/unicode-org/icu4x/pull/3547/files#diff-1ab141f559ba2ebd644683b4cf5255a30d1e0a7f949b9cf950522f1a8b0cbcc5R1146

skius commented 1 year ago

@sffc thinks about holding a reference to the source string in the ParseError itself: https://github.com/unicode-org/icu4x/pull/3547#discussion_r1234726864

sffc commented 1 year ago

Discuss with:

Optional:

skius commented 1 year ago

3670 introduces a MainToken-based main-parse-loop. This means in cases like [a-{hello\ world}] we have all the required data available to say "error: unexpected string, expected single code point", and improving these cases would be relatively simple.

It also introduces an edge case with an objectively bad error message:

This is bad because [a-\x{62}] is valid, in other words \ is not actually unexpected. The important thing causing the error is that it's a multi-codepoint-escape as part of a range.