Closed gregsdennis closed 1 year ago
Umm… https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction
It looks like the underlying APIs support Unicode just fine, although UTF-16 is annoying. I think your reference is some particular regexp implementation being buggy?
The chances of the IETF shipping an RFC dealing with textual data that excludes Unicode is more or less exactly zero. That would exclude the languages of a very large majority of Earth's population.
It also appears that Rust doesn't support UTF-16 either.
I've updated the title and opening issue for more clarity.
Hmm… the network serialization format and the programming-language runtime format are often different things; for example, Java & .NET are UTF-16 internally, while Go is UTF-8. It would be very surprising (almost certainly a bug) for a UTF-16 surrogate to appear in stored or network-transmitted text. For that reason, I-JSON forbids UTF-16 surrogates in member names or string values.
So, the idea of JSONPath also forbidding surrogate codepoints strikes me as sensible. Full Unicode support does not require surrogate support, so there is no contradiction between forbidding surrogates and requiring full Unicode support.
Pardon me for over-reacting, I spent several years of my career forcing people who didn't want to to support Unicode.
Pardon me for over-reacting, I spent several years of my career forcing people who didn't want to to support Unicode.
I have a preconception that UTF-8 = extended ASCII (256 chars), so I default "Unicode" to being surrogate pairs in my mind. I wasn't specific enough.
the problem is that specifically UTF-16 is not supported
Nobody needs UTF-16.
Both JSON and JSONPath are UTF-8. Surrogate pairs do not exist in UTF-8 (and definitely not individual surrogates).
Can you explain the problem in a way that somebody familiar with Unicode could understand?
Note that much of the confusion in the referenced articles is about regexes that operate on UTF-16 code units and not on Unicode scalar values. That kind of regexp is irrelevant for iregexp. As a simple indicator, note that there is no \p{Cs}
in iregexp, because surrogates do not occur in the inputs; any discussion that uses \p{Cs}
probably is in the confused set.
Both JSON and JSONPath are UTF-8. Surrogate pairs do not exist in UTF-8 (and definitely not individual surrogates).
But the test in question uses an escaped pair sequence. That escaped sequence itself is UTF-8.
That test I referenced says the escaped pair sequence denotes a single UTF-16 code point that apparently the regex is supposed to consider as a single character, matched by .
.
Discussion of UTF-16 is off-topic for this specification. (I also don't know what the "test in question" is; several tests have been offered to lay bare the implementation limitations involved in certain platforms.)
Discussion of UTF-16 is off-topic for this specification.
How so? The spec clearly states "full Unicode." I take that to mean UTF-16 is included.
I also don't know what the "test in question" is
The test in question is the test I linked to in the opening comment (the "Ref").
Carsten, if UTF-16 surrogates "do not exist" in UTF-8, why would I-JSON explicitly forbid them? It is perfectly easy to imagine a scenario where such things are accidentally generated. Lots of people write their own UTF-8 encoders because it's "easy". Also, because of Java's annoying 16-bit "char" data type, it has been well-known for people to do stupid things like "just take the first ten characters of this string for display in a fixed size column via s.substring(0, 10) - oops!
On the face of it, I think it's perfectly reasonable to say that both the JSONpath expression and the JSON value to which it's applied MUST NOT contain any surrogate codepoints.
On Mon, Apr 10, 2023 at 10:27 PM cabo @.***> wrote:
the problem is that specifically UTF-16 is not supported
Nobody needs UTF-16.
Both JSON and JSONPath are UTF-8. Surrogate pairs do not exist in UTF-8 (and definitely not individual surrogates).
Can you explain the problem in a way that somebody familiar with Unicode could understand?
— Reply to this email directly, view it on GitHub https://github.com/ietf-wg-jsonpath/iregexp/issues/22#issuecomment-1502707727, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEJE7XEDUCTELBBOB5HW3XATTVXANCNFSM6AAAAAAWZQ3W4Q . You are receiving this because you commented.Message ID: @.***>
Carsten, if UTF-16 surrogates "do not exist" in UTF-8, why would I-JSON explicitly forbid them?
I don't know why you did this, but I expect the reason was "Because people don't read".
https://datatracker.ietf.org/doc/html/rfc3629#page-5 Note that this is a full Internet Standard from 2003.
It is perfectly easy to imagine a scenario where such things are accidentally generated. Lots of people write their own UTF-8 encoders because it's "easy". Also, because of Java's annoying 16-bit "char" data type, it has been well-known for people to do stupid things like "just take the first ten characters of this string for display in a fixed size column via s.substring(0, 10) - oops!
I am fully aware of the ways to generate non-Unicode in languages with legacy 16-bit text string models. (You can do the same with UTF-8, but then it is way more obvious when you do, so most people don't.)
On the face of it, I think it's perfectly reasonable to say that both the JSONpath expression and the JSON value to which it's applied MUST NOT contain any surrogate codepoints.
See, this is how I-JSON confused you. This is called the restatement antipattern. Restating a normative statement from a reference as if it were a new statement makes people believe the reference didn't already say this, or, worse, the new document says something new and different from the reference.
When you need to restate, you need to qualify the restatement very explicitly as such. I don't think page 5 of RFC 3629 needs a lot of restating, though.
Discussion of UTF-16 is off-topic for this specification.
How so? The spec clearly states "full Unicode." I take that to mean UTF-16 is included.
The UTFs are Unicode transformation formats, there are weird ones you don't need to support. If people think that "full Unicode support" needs to include every single specification of the Unicode organization, then maybe we need to be more explicit.
I also don't know what the "test in question" is
The test in question is the test I linked to in the opening comment (the "Ref").
Ah. You said something about surrogate pairs. There are no surrogate pairs here. (You might confuse JSON's abominable hex escape syntax for non-BMP characters with surrogate pairs, but they aren't.)
Interestingly, ChatGPT tells me C# doesn't have any problem with non-BMP characters.
It offers code like
Regex regex = new Regex(@"[\x{1F914}\x{1F602}]");
to match either a Thinking Face or a Face With Tears of Joy, which is not possible if these aren't characters in C#. Is ChatGPT hallucinating (as it something does)?
I created PR #23 to clarify "full Unicode support". Anything else that needs to be clarified from this issue?
Interestingly, ChatGPT tells me C# doesn't have any problem with non-BMP characters.
It offers code like
Regex regex = new Regex(@"[\x{1F914}\x{1F602}]");
to match either a Thinking Face or a Face With Tears of Joy, which is not possible if these aren't characters in C#. Is ChatGPT hallucinating (as it something does)?
First, that regex isn't valid.
It compiles, but running that line produces an exception:
System.Text.RegularExpressions.RegexParseException : Invalid pattern '[\x{1F914}\x{1F602}]' at offset 4. Insufficient hexadecimal digits.
at System.Text.RegularExpressions.RegexParser.ScanHex(Int32 c)
at System.Text.RegularExpressions.RegexParser.ScanCharEscape()
at System.Text.RegularExpressions.RegexParser.ScanCharClass(Boolean caseInsensitive, Boolean scanOnly)
at System.Text.RegularExpressions.RegexParser.CountCaptures()
at System.Text.RegularExpressions.RegexParser.Parse(String pattern, RegexOptions options, CultureInfo culture)
at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options, TimeSpan matchTimeout, Boolean addToCache)
at System.Text.RegularExpressions.Regex..ctor(String pattern)
at ...
Yes, non-BMP chars are supported in C#. I can include that char directly in a regex and it works:
var regex = new Regex(@"😁");
var text = @"😁";
Assert.IsTrue(regex.IsMatch(text));
It even matches it as a single char (for .
anyway):
var regex = new Regex(@".");
var text = @"😁";
Assert.IsTrue(regex.IsMatch(text));
It looks like .Net's Regex
engine might just be buggy.
This passes
var regex = new Regex(@"^🐲*$");
var jsonText = JsonNode.Parse("\"🐲\"");
var text = jsonText.GetValue<string>();
Assert.IsTrue(regex.IsMatch(text));
but this doesn't
var regex = new Regex(@"^🐲*$");
var jsonText = JsonNode.Parse("\"🐲🐲\"");
var text = jsonText.GetValue<string>();
Assert.IsTrue(regex.IsMatch(text));
Basically, the *
doesn't match on the multiple bytes correctly. The behavior is the same whether the 🐲 is explicit as I have above or hex-encoded with a surrogate pair.
Probably just a .Net issue. Still I can't fully support i-regexp for this limitation. I expect it's probablly sufficient to call that out in my docs.
var regex = new Regex(@"^🐲*$"); var jsonText = JsonNode.Parse("\"🐲🐲\"");
That can be fixed by replacing 🐲 with (🐲). However, a fix is not quite as easy with characters in character classes, [abc🐲] would need to be replaced by ([abc]|🐲). Doing negative character classes probably requires using lookahead assertions. This all can be done by taking apart and putting back together the RE, but is a far cry from the relatively simple textual substitutions that PCRE or ECMAScript require.
That can be fixed by replacing 🐲 with (🐲).
It's fine that there is a workaround, but if just 🐲 is expected to work, then I can't expect my users to know to use (🐲).
Didn't mean to close.
That can be fixed by replacing 🐲 with (🐲).
It's fine that there is a workaround, but if just 🐲 is expected to work, then I can't expect my users to know to use (🐲).
This is not for specifiers of iregexp REs -- they do not have to know that certain platforms have certain problems. It would be something that an iregexp to dotnet RE translator would do (just like how an iregexp to ECMAscript RE translator would translate unescaped dots).
That can be fixed by replacing 🐲 with (🐲).
It can't be fixed for the problem in question, though.
The test is checking that a.b
matches (e.g.) a🐲b
. .Net's Regex
can't do this because the .
won't match on the 🐲
. There's no workaround that I can think of for this. There's no translation of .
that will make .Net's Regex
accept a non-BMP char. (It seems to work fine when the .
is on its own, though, which is weird.)
.Net's Regex can't do this because the . won't match on the 🐲
So your iregexp to dotnet translator needs to turn (unescaped, outside character classes) .
into (\P{Cs}|\p{Cs}\p{Cs})
with a little bit of lookahead assertion added to remove \r
and \n
.
I still can't believe dotnet doesn't have a /re/u
equivalent.
So your iregexp to dotnet translator needs to turn (unescaped, outside character classes) . into (\P{Cs}|\p{Cs}\p{Cs}) with a little bit of lookahead assertion added to remove \r and \n.
I don't think that most developers (specifically implementors of JSON Path) are going to be well-versed enough in regular expressions to be able to figure this kind of thing out. It works, but I don't understand what \P{Cs}
is. This seems like a rather advanced regex use to me.
I don't understand what
\P{Cs}
is
You are excused, because that is not part of iregexp (surrogate code points are irrelevant to iregexp).
(\p{..}
and \P{..}
are important when regular expressions are used with Unicode, as they express concepts such as numbers, letters, symbols etc.)
I'm afraid there is no way to remove the complexity of basing an iregexp implementation on a regex flavor with limited Unicode support. Fortunately, iregexp is easy to parse (= take apart), and a tool to put back together the iregexp in another flavor is not that complex either. I would prefer to be able to point to mine here, but I haven't done the work yet...
~Unicode characters~ Surrogate pairs are not supported in .Net. See example here.
The main goal of this specification as stated is to provide an interoperable subset of major Regex implementations (from the abstract, emphasis mine):
Mention of full unicode support needs to be ~removed~ downgraded to UTF-8 support.
Ref: https://github.com/jsonpath-standard/jsonpath-compliance-test-suite/pull/30/files#r1162157209
I'm not sure the reasoning behind adding this requirement.