jsdom / tr46

An implementation of the Unicode UTS #46: Unicode IDNA Compatibility Processing.
MIT License
32 stars 14 forks source link

Split out findStatus() to separate file (for browserify) #9

Closed stevenvachon closed 7 years ago

stevenvachon commented 7 years ago

...so that it can be aliased to always return "valid" for lighter browser builds. I can PR this if you agree with the decision.

Sebmaster commented 7 years ago

That doesn't really make any sense, I think? At that point the processing function deteriorates to a string.normalize(), the toUnicode function is basically a no-op otherwise and you can shim the toASCII method really easily by just splitting the string into labels and applying punycode manually.

stevenvachon commented 7 years ago

toASCII() uses punycode.toASCII() toUnicode() appears to use punycode.toUnicode()

Sebmaster commented 7 years ago

toUnicode() appears to use punycode.toUnicode()

It doesn't.

toASCII() uses punycode.toASCII()

Yeah, and that's (basically) the only thing it does if you always return valid from findStatus. You might as well use a shim js file which just calls into punycode yourself to remove the whole need for tr46 then.

stevenvachon commented 7 years ago

That doesn't make sense to me. toASCII() converts to punycode and toUnicode() never converts back?

What does this do?: https://github.com/Sebmaster/tr46.js/blob/master/index.js#L109 (my npm installed file uses toUnicode() instead, by the way)

Sebmaster commented 7 years ago

Ah yeah, I overlooked that. Point still stands though. Just shim both methods with a split('.').map(l => punycode.en/decode(l))?

stevenvachon commented 7 years ago

I end up with strange results.. suffixed "-" characters and missing "xn--" prefix.

Edit: Hmm, when switching back from encode/decode to toASCII/toUnicode, it works.

Thank you, kindly.

stevenvachon commented 7 years ago

How different would toLowerCase() and punycode's toASCII() be from the TR46 spec? I'm looking for specific examples to document. The spec mentions that case-folding is different than lowercasing, "particularly for Cherokee characters". What would be an example of this, and are there other particulars besides Cherokee?

Sebmaster commented 7 years ago

Check out the actual mapping table. Anything that has mapped as it's type is transformed to another character (there are some special cases for std3/transitional processing, but mapped is a safe bet). You gotta figure out when it does not match up with toLowerCase though.

stevenvachon commented 7 years ago

Awesome. There're only 86 mappings that don't correspond to lowercasing. We could maybe lighten this library by using an array (or memoized map, for performance) of characters:

const specialCases = ["Ꭽ"];

if (specialCases.includes( domain_name[i] )) {
  processed += domain_name[i];
} else {
  processed += domain_name[i].toLowerCase();
}

Iterating through a long mapping table wouldn't be necessary.

Sebmaster commented 7 years ago

We will also need to check if there are any uppercase chars which should not be mapped and we'll have to think about the std3 and transitional processing as well.

stevenvachon commented 7 years ago

Oops, I had the numbers wrong. There're 992 lowercasings, 86 uppercasings and 4513 foldings.