jsdom / tr46

An implementation of the Unicode UTS #46: Unicode IDNA Compatibility Processing.
MIT License
32 stars 14 forks source link

Fix bidi behaviour #13

Closed Sebmaster closed 6 years ago

Sebmaster commented 7 years ago

Bidi rules are hard.

There's a definition of a "bidi domain" in the UTS46 spec, however adding a check for that, breaks some IdnaTest.txt tests. @TimothyGu Any idea what this could be?

TimothyGu commented 7 years ago

Interesting, the definition for Bidi domain name was added in the published rev. 19 but not the rev. 18 draft I implemented. I'll take a look at this, though IdnaTest.txt is hard to get through.

TimothyGu commented 7 years ago

Okay, some serious WTF going on here. One of the error cases is:

B;  .f; [A4_2]; [A4_2]

which basically means "under Both transitional and nontransitional processing, both ToASCII and ToUnicode should error out at ToASCII step 4.2". This makes no sense. Why the heck would ToUnicode error out on a step it doesn't even have?

For reference, ToASCII step 4 is:

  1. If VerifyDnsLength flag is true, then verify DNS length restrictions. This may record an error. For more information, see [STD13] and [STD3].
    1. The length of the domain name, excluding the root label and its dot, is from 1 to 253.
    2. The length of each label is from 1 to 63.

The reason why ToUnicode doesn't have this step is because we can't get the length of the domain name that actually matters w/o first Punycode-encode it. But, I suppose when the length is 0 even in Unicode mode, we know that the label will be 0-length in ASCII as well. I'll look into creating such a work-around which will hopefully fix all the failing test cases, since all of them either have empty labels or don't have any labels (i.e. the empty string).

The code worked before, since the Bidi Rule has:

The following rule, consisting of six conditions, applies to labels in Bidi domain names. ...

  1. The first character must be a character with Bidi property L, R, or AL.

which implies that there must be at least one character in every label.

We should definitely look into filing an erratum to Unicode.

TimothyGu commented 7 years ago

Patch I had in mind, that fixes all the tests:

From f5342a0bd656b702a905b0e16045423ea4382c41 Mon Sep 17 00:00:00 2001
From: Timothy Gu <timothygu99@gmail.com>
Date: Fri, 28 Jul 2017 23:29:06 +0800
Subject: [PATCH] Add verifyDNSLength to toUnicode

Non-spec. Only checks if the domain or the label is empty, as that's the
only thing we *can* check. Needed to get IdnaTest.txt passing fully.
---
 index.js        | 20 ++++++++++++++++++--
 test/unicode.js |  3 ++-
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/index.js b/index.js
index 1872435..40d76a6 100644
--- a/index.js
+++ b/index.js
@@ -265,7 +265,8 @@ function toUnicode(domainName, {
   checkHyphens = false,
   checkBidi = false,
   checkJoiners = false,
-  useSTD3ASCIIRules = false
+  useSTD3ASCIIRules = false,
+  verifyDNSLength = false
 } = {}) {
   const result = processing(domainName, {
     processingOption: "nontransitional",
@@ -274,10 +275,25 @@ function toUnicode(domainName, {
     checkJoiners,
     useSTD3ASCIIRules
   });
+  let hasError = result.error;
+
+  if (!hasError && verifyDNSLength) {
+    const total = result.string.length;
+    if (total === 0) {
+      hasError = true;
+    } else {
+      for (const label of result.string.split(".")) {
+        if (label.length === 0) {
+          hasError = true;
+          break;
+        }
+      }
+    }
+  }

   return {
     domain: result.string,
-    error: result.error
+    error: hasError
   };
 }

diff --git a/test/unicode.js b/test/unicode.js
index 72d1aab..ce9d9d2 100644
--- a/test/unicode.js
+++ b/test/unicode.js
@@ -59,7 +59,8 @@ function testConversion(test) {
       checkHyphens: true,
       checkBidi: true,
       checkJoiners: true,
-      useSTD3ASCIIRules: true
+      useSTD3ASCIIRules: true,
+      verifyDNSLength: true
     });
     if (test[2][0] === "[") { // Error code
       assert.ok(res.error, "ToUnicode should result in an error");
-- 
2.11.0

Also re "Bidi domain name", RFC 5893 provides a much more satisfactory definition of it that IMO would've prevented the errors in 1a08bf3c3cd430a6da21bdd4be79ef0344d44ce6:

An RTL label is a label that contains at least one character of type R, AL, or AN.

An LTR label is any label that is not an RTL label.

A "Bidi domain name" is a domain name that contains at least one RTL label. (Note: This definition includes domain names containing only dots and right-to-left characters. Providing a separate category of "RTL domain names" would not make this specification simpler, so it has not been done.)

Compare with UTS#46:

A Bidi domain name is a domain name containing at least one character with Bidi_Class R, AL, or AN. See [IDNA2008] RFC 5893, Section 1.4.

which even though is "correct" doesn't make it clear that the check should be label-based.

TimothyGu commented 7 years ago

Wrote a corrigendum to Unicode through their feedback form:

Issues with UTS #46's conformance test file > To whoever it may concern, > > While developing a product conforming to "UTS#46: Unicode IDNA Compatibility Processing, Version 10.0.0" [[UTS46]], we noticed a few issues with the provided conformance testing file (IdnaTest.txt). These issues are preventing us from implementing UTS#46 in tr46.js [[TR46JS-ISSUE]]. > > The IdnaTest.txt file is formatted as a list of semicolon-separated values. The meanings of the specific columns are given in UTS#46 Section 8.1, an excerpt of which is hereby reproduced [[UTS46]]: > > > No | Field | Description > > ----|---------|---------- > > ... > > 3 | toUnicode | The result of applying toUnicode to the source, using "nontransitional".
A blank value means the same as the source value; a value in [...] is a set of error codes. > > 4 | toASCII | The result of applying toASCII to the source, using the specified type: T, N, or B.
A blank value means the same as the toUnicode value; a value in [...] is a set of error codes. > > > > ... > > > > An error in toUnicode or toASCII is indicated by an error list of the form [...]. In such a case, the contents of that list are error codes based on the step numbers in UTS46 and IDNA2008: > > > > ... > > > > > An for Section 4.2 ToASCII, step n > > > > ... > > Given that "An" applies only to the ToASCII algorithm, not the ToUnicode algorithm, it seems appropriate for field "toUnicode" in IdnaTest.txt to never have an error code of form An. Yet, in the published IdnaTest.txt file corresponding to version 10.0.0 [[IDNA-TEST]], there exist 305 entries in IdnaTest.txt where an "An" error code appears under "toUnicode". In particular, there exist 36 entries with _only_ an "An" error code under "toUnicode" -- which, in other words, means that the only justification for erroring on those entries from ToUnicode is not actually in ToUnicode. > > This is particularly troubling, since while the Standard allows for ADDITIONAL error cases than ones already specified in IdnaTest.txt, a product conforming to UTS#46 must produce an error on ALL error cases in IdnaTest.txt, per lines 68-72 of IdnaTest.txt, again reproduced below: > > > ... Thus to then verify conformance for the toASCII and toUnicode columns: > > > > - If the file indicates an error, the implementation must also have an error. > > - If the file does not indicate an error, then the implementation must either have an error, or must have a matching result. > > A close examination of the 36 entries mentioned above reveals that: > > - 9 of the 36 entries have only "[A3]" error code under ToUnicode, which corresponds to the Punycode-encoding step in ToASCII. The source domains all have one label with invalid Punycode-encoding though, so they would in fact have already recorded an error in no. 4 of Processing Steps, which is called upon by ToUnicode as well. In other words, these entries merely have a faulty error code; ToUnicode would still record an error for these entries, just one at a different step than advertised. > > Some samples from these 9 entries are: > > Line 313: B; xn--0.pt; [A3]; [A3] > Line 315: B; xn--a-Ä.pt; [A3]; [A3] > Line 316: B; xn--a-A\u0308.pt; [A3]; [A3] > > - The other 27 entries have only a "[A4_2]" error code under ToUnicode, which corresponds to the DNS length verification step under ToASCII. Some of them are: > > Line 201: B; 。; [A4_2]; [A4_2] > Line 202: B; .; [A4_2]; [A4_2] > Line 434: B; a..c; [A4_2]; [A4_2] > Line 439: B; ä.\u00AD.c; [A4_2]; [A4_2] > > While these domain names are all rather unlikely to be allowed by real-world UTS#46 implementations, most (if not all) of them are still strictly allowed by ToUnicode as defined in UTS#46. > > Take line 201, for example. Step 1 of ToUnicode call into the Processing Steps, whose step 1 will map '。' to '.', and which will then pass through the rest of Processing Steps without recording an error. Step 2 of ToUnicode will then produce a "converted Unicode string" of '.', and signal there was no error. > > The 27 entries in IdnaTest.txt with [A4_2] are the real worrying ones, since they seem to go against the algorithms defined in UTS#46, and prevent us from creating a strict implementation of UTS#46 without passing its own conformance tests. > > To resolve these issues, I would like to see the following: > > - A clarification whether the aforementioned 27 entries should record an error in ToUnicode. > - Corresponding changes to IdnaTest.txt or UTS#46 that accompany that clarification. > - There be no entries in IdnaTest.txt with a ToUnicode error code that point to steps in ToASCII. > > Sincerely, > > Timothy Gu > > [UTS46]: http://www.unicode.org/reports/tr46/tr46-19.html > [IDNA-TEST]: http://www.unicode.org/Public/idna/10.0.0/IdnaTest.txt > [TR46JS-ISSUE]: https://github.com/Sebmaster/tr46.js/pull/13

The feedback page says:

Each report is reviewed by a staff member in our office. You can expect an acknowledgement of your report within 2-3 business days.

Sebmaster commented 6 years ago

@TimothyGu Added skipping of toUnicode tests if they fail due to VerifyDnsLength errors. You okay with this?

TimothyGu commented 6 years ago

I had an earlier patch:

2017-08-22 23:47:48     TimothyGu       Sebmaster: hey uh, I don't quite have commit access to tr46.js just yet, but I've created a patch that skips the buggy ToUnicode tests here: http://sprunge.us/fTGD

I would prefer that patch over the current one.

Sebmaster commented 6 years ago

Ah, I like this one much better as well.