Closed Sebmaster closed 6 years ago
Interesting, the definition for Bidi domain name was added in the published rev. 19 but not the rev. 18 draft I implemented. I'll take a look at this, though IdnaTest.txt is hard to get through.
Okay, some serious WTF going on here. One of the error cases is:
B; .f; [A4_2]; [A4_2]
which basically means "under Both transitional and nontransitional processing, both ToASCII and ToUnicode should error out at ToASCII step 4.2". This makes no sense. Why the heck would ToUnicode error out on a step it doesn't even have?
For reference, ToASCII step 4 is:
- If VerifyDnsLength flag is true, then verify DNS length restrictions. This may record an error. For more information, see [STD13] and [STD3].
- The length of the domain name, excluding the root label and its dot, is from 1 to 253.
- The length of each label is from 1 to 63.
The reason why ToUnicode doesn't have this step is because we can't get the length of the domain name that actually matters w/o first Punycode-encode it. But, I suppose when the length is 0 even in Unicode mode, we know that the label will be 0-length in ASCII as well. I'll look into creating such a work-around which will hopefully fix all the failing test cases, since all of them either have empty labels or don't have any labels (i.e. the empty string).
The code worked before, since the Bidi Rule has:
The following rule, consisting of six conditions, applies to labels in Bidi domain names. ...
- The first character must be a character with Bidi property L, R, or AL.
which implies that there must be at least one character in every label.
We should definitely look into filing an erratum to Unicode.
Patch I had in mind, that fixes all the tests:
From f5342a0bd656b702a905b0e16045423ea4382c41 Mon Sep 17 00:00:00 2001
From: Timothy Gu <timothygu99@gmail.com>
Date: Fri, 28 Jul 2017 23:29:06 +0800
Subject: [PATCH] Add verifyDNSLength to toUnicode
Non-spec. Only checks if the domain or the label is empty, as that's the
only thing we *can* check. Needed to get IdnaTest.txt passing fully.
---
index.js | 20 ++++++++++++++++++--
test/unicode.js | 3 ++-
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/index.js b/index.js
index 1872435..40d76a6 100644
--- a/index.js
+++ b/index.js
@@ -265,7 +265,8 @@ function toUnicode(domainName, {
checkHyphens = false,
checkBidi = false,
checkJoiners = false,
- useSTD3ASCIIRules = false
+ useSTD3ASCIIRules = false,
+ verifyDNSLength = false
} = {}) {
const result = processing(domainName, {
processingOption: "nontransitional",
@@ -274,10 +275,25 @@ function toUnicode(domainName, {
checkJoiners,
useSTD3ASCIIRules
});
+ let hasError = result.error;
+
+ if (!hasError && verifyDNSLength) {
+ const total = result.string.length;
+ if (total === 0) {
+ hasError = true;
+ } else {
+ for (const label of result.string.split(".")) {
+ if (label.length === 0) {
+ hasError = true;
+ break;
+ }
+ }
+ }
+ }
return {
domain: result.string,
- error: result.error
+ error: hasError
};
}
diff --git a/test/unicode.js b/test/unicode.js
index 72d1aab..ce9d9d2 100644
--- a/test/unicode.js
+++ b/test/unicode.js
@@ -59,7 +59,8 @@ function testConversion(test) {
checkHyphens: true,
checkBidi: true,
checkJoiners: true,
- useSTD3ASCIIRules: true
+ useSTD3ASCIIRules: true,
+ verifyDNSLength: true
});
if (test[2][0] === "[") { // Error code
assert.ok(res.error, "ToUnicode should result in an error");
--
2.11.0
Also re "Bidi domain name", RFC 5893 provides a much more satisfactory definition of it that IMO would've prevented the errors in 1a08bf3c3cd430a6da21bdd4be79ef0344d44ce6:
An RTL label is a label that contains at least one character of type R, AL, or AN.
An LTR label is any label that is not an RTL label.
A "Bidi domain name" is a domain name that contains at least one RTL label. (Note: This definition includes domain names containing only dots and right-to-left characters. Providing a separate category of "RTL domain names" would not make this specification simpler, so it has not been done.)
Compare with UTS#46:
A Bidi domain name is a domain name containing at least one character with Bidi_Class R, AL, or AN. See [IDNA2008] RFC 5893, Section 1.4.
which even though is "correct" doesn't make it clear that the check should be label-based.
Wrote a corrigendum to Unicode through their feedback form:
The feedback page says:
Each report is reviewed by a staff member in our office. You can expect an acknowledgement of your report within 2-3 business days.
@TimothyGu Added skipping of toUnicode tests if they fail due to VerifyDnsLength errors. You okay with this?
I had an earlier patch:
2017-08-22 23:47:48 TimothyGu Sebmaster: hey uh, I don't quite have commit access to tr46.js just yet, but I've created a patch that skips the buggy ToUnicode tests here: http://sprunge.us/fTGD
I would prefer that patch over the current one.
Ah, I like this one much better as well.
Bidi rules are hard.
There's a definition of a "bidi domain" in the UTS46 spec, however adding a check for that, breaks some IdnaTest.txt tests. @TimothyGu Any idea what this could be?