Support for UTF-16 surrogate pairs is not interoperable

gregsdennis commented 1 year ago

~Unicode characters~ Surrogate pairs are not supported in .Net. See example here.

The main goal of this specification as stated is to provide an interoperable subset of major Regex implementations (from the abstract, emphasis mine):

This document specifies I-Regexp, a flavor of regular expressions that is limited in scope with the goal of interoperation across many different regular-expression libraries.

Mention of full unicode support needs to be ~removed~ downgraded to UTF-8 support.

Introduction

I-Regexp supports the entire repertoire of Unicode characters.
I-Regexp Syntax

In particular, full Unicode support is REQUIRED; the implementation MUST NOT limit itself to 7- or 8-bit character sets such as ASCII and MUST support the Unicode character property set in character classes.

Ref: https://github.com/jsonpath-standard/jsonpath-compliance-test-suite/pull/30/files#r1162157209

I'm not sure the reasoning behind adding this requirement.

timbray commented 1 year ago

Umm… https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction

It looks like the underlying APIs support Unicode just fine, although UTF-16 is annoying. I think your reference is some particular regexp implementation being buggy?

The chances of the IETF shipping an RFC dealing with textual data that excludes Unicode is more or less exactly zero. That would exclude the languages of a very large majority of Earth's population.

gregsdennis commented 1 year ago

Yes, I believe the problem is that specifically UTF-16 is not supported (Ref, OtherRef). This is a common question among .Net users. (Example)

I think your reference is some particular regexp implementation being buggy?

I have only ever used the built-in Regex class for regular expressions.

gregsdennis commented 1 year ago

It also appears that Rust doesn't support UTF-16 either.

gregsdennis commented 1 year ago

I've updated the title and opening issue for more clarity.

timbray commented 1 year ago

Hmm… the network serialization format and the programming-language runtime format are often different things; for example, Java & .NET are UTF-16 internally, while Go is UTF-8. It would be very surprising (almost certainly a bug) for a UTF-16 surrogate to appear in stored or network-transmitted text. For that reason, I-JSON forbids UTF-16 surrogates in member names or string values.

So, the idea of JSONPath also forbidding surrogate codepoints strikes me as sensible. Full Unicode support does not require surrogate support, so there is no contradiction between forbidding surrogates and requiring full Unicode support.

Pardon me for over-reacting, I spent several years of my career forcing people who didn't want to to support Unicode.

gregsdennis commented 1 year ago

Pardon me for over-reacting, I spent several years of my career forcing people who didn't want to to support Unicode.

I have a preconception that UTF-8 = extended ASCII (256 chars), so I default "Unicode" to being surrogate pairs in my mind. I wasn't specific enough.

cabo commented 1 year ago

the problem is that specifically UTF-16 is not supported

Nobody needs UTF-16.

Both JSON and JSONPath are UTF-8. Surrogate pairs do not exist in UTF-8 (and definitely not individual surrogates).

Can you explain the problem in a way that somebody familiar with Unicode could understand?

cabo commented 1 year ago

Note that much of the confusion in the referenced articles is about regexes that operate on UTF-16 code units and not on Unicode scalar values. That kind of regexp is irrelevant for iregexp. As a simple indicator, note that there is no \p{Cs} in iregexp, because surrogates do not occur in the inputs; any discussion that uses \p{Cs} probably is in the confused set.

gregsdennis commented 1 year ago

Both JSON and JSONPath are UTF-8. Surrogate pairs do not exist in UTF-8 (and definitely not individual surrogates).

But the test in question uses an escaped pair sequence. That escaped sequence itself is UTF-8.

That test I referenced says the escaped pair sequence denotes a single UTF-16 code point that apparently the regex is supposed to consider as a single character, matched by ..

cabo commented 1 year ago

Discussion of UTF-16 is off-topic for this specification. (I also don't know what the "test in question" is; several tests have been offered to lay bare the implementation limitations involved in certain platforms.)

gregsdennis commented 1 year ago

Discussion of UTF-16 is off-topic for this specification.

How so? The spec clearly states "full Unicode." I take that to mean UTF-16 is included.

I also don't know what the "test in question" is

The test in question is the test I linked to in the opening comment (the "Ref").

timbray commented 1 year ago

Carsten, if UTF-16 surrogates "do not exist" in UTF-8, why would I-JSON explicitly forbid them? It is perfectly easy to imagine a scenario where such things are accidentally generated. Lots of people write their own UTF-8 encoders because it's "easy". Also, because of Java's annoying 16-bit "char" data type, it has been well-known for people to do stupid things like "just take the first ten characters of this string for display in a fixed size column via s.substring(0, 10) - oops!

On the face of it, I think it's perfectly reasonable to say that both the JSONpath expression and the JSON value to which it's applied MUST NOT contain any surrogate codepoints.

On Mon, Apr 10, 2023 at 10:27 PM cabo @.***> wrote:

the problem is that specifically UTF-16 is not supported

Nobody needs UTF-16.

Both JSON and JSONPath are UTF-8. Surrogate pairs do not exist in UTF-8 (and definitely not individual surrogates).

Can you explain the problem in a way that somebody familiar with Unicode could understand?

— Reply to this email directly, view it on GitHub https://github.com/ietf-wg-jsonpath/iregexp/issues/22#issuecomment-1502707727, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEJE7XEDUCTELBBOB5HW3XATTVXANCNFSM6AAAAAAWZQ3W4Q . You are receiving this because you commented.Message ID: @.***>

cabo commented 1 year ago

Carsten, if UTF-16 surrogates "do not exist" in UTF-8, why would I-JSON explicitly forbid them?

I don't know why you did this, but I expect the reason was "Because people don't read".

https://datatracker.ietf.org/doc/html/rfc3629#page-5 Note that this is a full Internet Standard from 2003.

It is perfectly easy to imagine a scenario where such things are accidentally generated. Lots of people write their own UTF-8 encoders because it's "easy". Also, because of Java's annoying 16-bit "char" data type, it has been well-known for people to do stupid things like "just take the first ten characters of this string for display in a fixed size column via s.substring(0, 10) - oops!

I am fully aware of the ways to generate non-Unicode in languages with legacy 16-bit text string models. (You can do the same with UTF-8, but then it is way more obvious when you do, so most people don't.)

On the face of it, I think it's perfectly reasonable to say that both the JSONpath expression and the JSON value to which it's applied MUST NOT contain any surrogate codepoints.

See, this is how I-JSON confused you. This is called the restatement antipattern. Restating a normative statement from a reference as if it were a new statement makes people believe the reference didn't already say this, or, worse, the new document says something new and different from the reference.

When you need to restate, you need to qualify the restatement very explicitly as such. I don't think page 5 of RFC 3629 needs a lot of restating, though.

cabo commented 1 year ago

Discussion of UTF-16 is off-topic for this specification.

How so? The spec clearly states "full Unicode." I take that to mean UTF-16 is included.

The UTFs are Unicode transformation formats, there are weird ones you don't need to support. If people think that "full Unicode support" needs to include every single specification of the Unicode organization, then maybe we need to be more explicit.

I also don't know what the "test in question" is

The test in question is the test I linked to in the opening comment (the "Ref").

Ah. You said something about surrogate pairs. There are no surrogate pairs here. (You might confuse JSON's abominable hex escape syntax for non-BMP characters with surrogate pairs, but they aren't.)

Interestingly, ChatGPT tells me C# doesn't have any problem with non-BMP characters.

It offers code like

Regex regex = new Regex(@"[\x{1F914}\x{1F602}]");

to match either a Thinking Face or a Face With Tears of Joy, which is not possible if these aren't characters in C#. Is ChatGPT hallucinating (as it something does)?

cabo commented 1 year ago

I created PR #23 to clarify "full Unicode support". Anything else that needs to be clarified from this issue?

gregsdennis commented 1 year ago

Interestingly, ChatGPT tells me C# doesn't have any problem with non-BMP characters.

It offers code like
Regex regex = new Regex(@"[\x{1F914}\x{1F602}]");
to match either a Thinking Face or a Face With Tears of Joy, which is not possible if these aren't characters in C#. Is ChatGPT hallucinating (as it something does)?

First, that regex isn't valid.

It compiles, but running that line produces an exception:

System.Text.RegularExpressions.RegexParseException : Invalid pattern '[\x{1F914}\x{1F602}]' at offset 4. Insufficient hexadecimal digits.
   at System.Text.RegularExpressions.RegexParser.ScanHex(Int32 c)
   at System.Text.RegularExpressions.RegexParser.ScanCharEscape()
   at System.Text.RegularExpressions.RegexParser.ScanCharClass(Boolean caseInsensitive, Boolean scanOnly)
   at System.Text.RegularExpressions.RegexParser.CountCaptures()
   at System.Text.RegularExpressions.RegexParser.Parse(String pattern, RegexOptions options, CultureInfo culture)
   at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options, TimeSpan matchTimeout, Boolean addToCache)
   at System.Text.RegularExpressions.Regex..ctor(String pattern)
   at ...

Yes, non-BMP chars are supported in C#. I can include that char directly in a regex and it works:

var regex = new Regex(@"😁");
var text = @"😁";
Assert.IsTrue(regex.IsMatch(text));

It even matches it as a single char (for . anyway):

var regex = new Regex(@".");
var text = @"😁";
Assert.IsTrue(regex.IsMatch(text));

It looks like .Net's Regex engine might just be buggy.

This passes

var regex = new Regex(@"^🐲*$");
var jsonText = JsonNode.Parse("\"🐲\"");
var text = jsonText.GetValue<string>();
Assert.IsTrue(regex.IsMatch(text));

but this doesn't

var regex = new Regex(@"^🐲*$");
var jsonText = JsonNode.Parse("\"🐲🐲\"");
var text = jsonText.GetValue<string>();
Assert.IsTrue(regex.IsMatch(text));

Basically, the * doesn't match on the multiple bytes correctly. The behavior is the same whether the 🐲 is explicit as I have above or hex-encoded with a surrogate pair.

Probably just a .Net issue. Still I can't fully support i-regexp for this limitation. I expect it's probablly sufficient to call that out in my docs.

cabo commented 1 year ago

var regex = new Regex(@"^🐲*$");
var jsonText = JsonNode.Parse("\"🐲🐲\"");

That can be fixed by replacing 🐲 with (🐲). However, a fix is not quite as easy with characters in character classes, [abc🐲] would need to be replaced by ([abc]|🐲). Doing negative character classes probably requires using lookahead assertions. This all can be done by taking apart and putting back together the RE, but is a far cry from the relatively simple textual substitutions that PCRE or ECMAScript require.

gregsdennis commented 1 year ago

That can be fixed by replacing 🐲 with (🐲).

It's fine that there is a workaround, but if just 🐲 is expected to work, then I can't expect my users to know to use (🐲).

gregsdennis commented 1 year ago

Didn't mean to close.

cabo commented 1 year ago

That can be fixed by replacing 🐲 with (🐲).

It's fine that there is a workaround, but if just 🐲 is expected to work, then I can't expect my users to know to use (🐲).

This is not for specifiers of iregexp REs -- they do not have to know that certain platforms have certain problems. It would be something that an iregexp to dotnet RE translator would do (just like how an iregexp to ECMAscript RE translator would translate unescaped dots).

gregsdennis commented 1 year ago

That can be fixed by replacing 🐲 with (🐲).

It can't be fixed for the problem in question, though.

The test is checking that a.b matches (e.g.) a🐲b. .Net's Regex can't do this because the . won't match on the 🐲. There's no workaround that I can think of for this. There's no translation of . that will make .Net's Regex accept a non-BMP char. (It seems to work fine when the . is on its own, though, which is weird.)

cabo commented 1 year ago

.Net's Regex can't do this because the . won't match on the 🐲

So your iregexp to dotnet translator needs to turn (unescaped, outside character classes) . into (\P{Cs}|\p{Cs}\p{Cs}) with a little bit of lookahead assertion added to remove \r and \n.

I still can't believe dotnet doesn't have a /re/u equivalent.

gregsdennis commented 1 year ago

So your iregexp to dotnet translator needs to turn (unescaped, outside character classes) . into (\P{Cs}|\p{Cs}\p{Cs}) with a little bit of lookahead assertion added to remove \r and \n.

I don't think that most developers (specifically implementors of JSON Path) are going to be well-versed enough in regular expressions to be able to figure this kind of thing out. It works, but I don't understand what \P{Cs} is. This seems like a rather advanced regex use to me.

cabo commented 1 year ago

I don't understand what \P{Cs} is

You are excused, because that is not part of iregexp (surrogate code points are irrelevant to iregexp).

(\p{..} and \P{..} are important when regular expressions are used with Unicode, as they express concepts such as numbers, letters, symbols etc.)

I'm afraid there is no way to remove the complexity of basing an iregexp implementation on a regex flavor with limited Unicode support. Fortunately, iregexp is easy to parse (= take apart), and a tool to put back together the iregexp in another flavor is not that complex either. I would prefer to be able to point to mine here, but I haven't done the work yet...

ietf-wg-jsonpath / iregexp

Support for UTF-16 surrogate pairs is not interoperable #22