fix: hostname corruption of PSL data

dlindberg commented 6 years ago

Introduction

Issue #216 issues with the parser failing when idnToAscii(); encounters illegal characters from an apparently good download of public_suffix_list.dat.

Proposal

Use multibyte strtolower();

Describe the new/upated/fixed feature

Change from strtolower() to mb_strtolower for the parser.

Backward Incompatible Changes.

None known, function is recommended in the php manual for working with multibyte unicode strings.

Targeted release version

Next patch release.

PR Impact

Possibility that it will resolve some other edge behaviors where strings may be mangled.

Open issues

No open issues, relates to closed question #216. Possible future project, update all string functions to their multibyte variants.

jeremykendall commented 6 years ago

Will need to also add ext-mbstring as a requirement to composer.json. Whether or not that requires a major version bump is not entirely clear, although I tend to favor major version bump as requiring ext-mbstring will cause upgrades to fail if users do not have that installed/enabled in their environment(s).

nyamsprod commented 6 years ago

@dlindberg I strongly disagree with this addition did you test the script I gave you in #216 . Again the bug is not on the package side but on how you are downloading the PSL. using mb_strtolower won't change the fact that the file is "corrupt" to begin with.

nyamsprod commented 6 years ago

I'll add my snippet here for illustration

<?php

$str = <<<EOF
// ===BEGIN ICANN DOMAINS===
aéroport.ci
// ===END ICANN DOMAINS===
// ===BEGIN PRIVATE DOMAINS===
// ===END PRIVATE DOMAINS===
EOF;

$str_lower = str_replace("aéroport.ci", strtolower("a\xE3\xA9roport.ci"), $str);
$str_mb_lower = str_replace("aéroport.ci", mb_strtolower("a\xE3\xA9roport.ci"), $str);

$converter = new Converter();
echo '<pre>', PHP_EOL;
var_export([
    'raw' => $converter->convert($str),
//    'strtolower' => $converter->convert($str_lower),
    'mb_strtolower' => $converter->convert($str_mb_lower),
]);

I've commented the 'strtolower' result as it throw an Exception on the dev branch which is the correct behaviour . Using mb_strtolower does not fix the issue it only add more problem by replacing é by ?

On the other hand I can improve conversion by using strtolower after idn_to_ascii conversion as the function already convert to lower case label containing UTF-8 characters but it will never fix a corrupt PSL .

dlindberg commented 6 years ago

Hmm, I had thought that ext-mbstring was a default extension, but it appears it is not. There is a potential logic issue on strtolower() after encoding, punnycode is case sensitive so doing that could cause false negatives.

However, I really don't think the file is corrupted, I think it is becoming corrupted when the mb string hits the non mb strtolower() function which is before the IDN converter that is throwing an exception. The php manual recommends against using strtolower() because it can have unexpected results and that the mb equivalents should be used when working with a multibyte character format. And recommends using the ext-mbstring if you are working with mb strings, or even if you are't, It usually doesn't matter much, but there are so many mb characters in the PSL that it seems like moving towards a fully multibyte code path would make sense. My read on the warning in the manual is that it seems like using non multibyte strtolower could have the same potential logic problem of false negatives as waiting until after the text is already US-ASCII and or "unpredictable" behavior.

Still, using mb_ does seem like something for the next major revision, as it is a more drastic change than I had though.

I ran across another way to lowercase a multibyte string in php while I was triple checking character by character the encoding of the file I was using. I'll take a closer look at that and see if it might be a better option.

nyamsprod commented 6 years ago

@dlindberg I think you are missing some key elements here.

IDNA conversion done using idn_to_ascii takes care of all the case sensitive issue on non-ASCII characters see https://3v4l.org/iTRPJ for instance after all, domain name are case insensitive and are preferred lowercased for maximum interoperability. That's why strtolower or mb_strtolower are not needed for IDN.
mb_strtolower has another thing going against its usage, it has a hidden second parameter that when omitted can modify the function output depending on the client server configuration see mb_strtolower.
Last but not least, the Pdp\Converter expects a PSL encoded in utf-8 which is simply not the case in your issue as I demonstrated earlier. That's why I believe that your patch is trying to fix a non existing issue.

dlindberg commented 6 years ago

In that case my concern on false negatives was unfounded.

The hidden parameter is a good point, the other function I was thinking of is also in the ext-mbstring extension, so does not resolve the dependency issue. But it is a much more explicit / better in the un-expected behaviors department. mb_convert_case($string, MB_CASE_LOWER, 'UTF-8') where all three parameters are required.

It seems like you have a better solution; but, I don't think I'm imagining things. Using strtolower() on a multibyte characters can be unpredictable. Solving that issue is the intent of the ext-mbstring extension..

To your example, from the command line:

#!/usr/bin/env php
<?php

$string = 'éééABcdéÉÁກ大众汽车';
echo $string . PHP_EOL;
echo strtolower($string) . PHP_EOL;
echo mb_convert_case($string, MB_CASE_LOWER, 'UTF-8') . PHP_EOL;

echo idn_to_ascii(strtolower($string), 0, INTL_IDNA_VARIANT_UTS46, $arr) . PHP_EOL;
var_dump($arr);
echo idn_to_ascii(mb_strtolower($string), 0, INTL_IDNA_VARIANT_UTS46, $arr) . PHP_EOL;
var_dump($arr);

Outputs:

éééABcdéÉÁກ大众汽车
���abcd���ກ大众汽车
éééabcdééáກ大众汽车

/Users/example:10:
array(3) {
  'result' =>
  string(37) "���abcd���ກ大众汽车"
  'isTransitionalDifferent' =>
  bool(false)
  'errors' =>
  int(128)
}
xn--abcd-8na7baaea5939a273m1l1abu0dfp8f
/Users/example:12:
array(3) {
  'result' =>
  string(39) "xn--abcd-8na7baaea5939a273m1l1abu0dfp8f"
  'isTransitionalDifferent' =>
  bool(false)
  'errors' =>
  int(0)
}

nyamsprod commented 6 years ago

@dlindberg where you are correct is in the usage of strtolower on non-ASCII characters that may introduce some type of error. But again this is tangent to your issue. your main issue is getting the PSL in proper utf-8 format. I've already address the strtolower issue by making it sure it is only used on ASCII characters throughout the code (see the develop branch). So again there is no need to add mbstring dependency when it is not needed at all.

dlindberg commented 6 years ago

Yes, you are correct, the new build works, all tests passing. I sometimes get a little stuck on the why of things in a not particularly helpful way. As in I really do want to know why strtolower is mangling multibyte strings on my development machine and not a vm I spun up and literally diffed the output of php -i. I even thought for a little while that mbstring.func_overload got flipped on someplace by accident, but no... Anyway, thats really not important right now. I really do want to thank both you and Jeremy, the code quality on this project is really top notch, and I really appreciate both your responsiveness and patience in helping me with this.

jeremykendall / php-domain-parser