Closed dlindberg closed 6 years ago
Will need to also add ext-mbstring
as a requirement to composer.json
. Whether or not that requires a major version bump is not entirely clear, although I tend to favor major version bump as requiring ext-mbstring
will cause upgrades to fail if users do not have that installed/enabled in their environment(s).
@dlindberg I strongly disagree with this addition did you test the script I gave you in #216 . Again the bug is not on the package side but on how you are downloading the PSL. using mb_strtolower
won't change the fact that the file is "corrupt" to begin with.
I'll add my snippet here for illustration
<?php
$str = <<<EOF
// ===BEGIN ICANN DOMAINS===
aéroport.ci
// ===END ICANN DOMAINS===
// ===BEGIN PRIVATE DOMAINS===
// ===END PRIVATE DOMAINS===
EOF;
$str_lower = str_replace("aéroport.ci", strtolower("a\xE3\xA9roport.ci"), $str);
$str_mb_lower = str_replace("aéroport.ci", mb_strtolower("a\xE3\xA9roport.ci"), $str);
$converter = new Converter();
echo '<pre>', PHP_EOL;
var_export([
'raw' => $converter->convert($str),
// 'strtolower' => $converter->convert($str_lower),
'mb_strtolower' => $converter->convert($str_mb_lower),
]);
I've commented the 'strtolower' result as it throw an Exception on the dev branch which is the correct behaviour . Using mb_strtolower
does not fix the issue it only add more problem by replacing é
by ?
On the other hand I can improve conversion by using strtolower
after idn_to_ascii
conversion as the function already convert to lower case label containing UTF-8 characters but it will never fix a corrupt PSL .
Hmm, I had thought that ext-mbstring
was a default extension, but it appears it is not. There is a potential logic issue on strtolower()
after encoding, punnycode is case sensitive so doing that could cause false negatives.
However, I really don't think the file is corrupted, I think it is becoming corrupted when the mb string hits the non mb strtolower()
function which is before the IDN converter that is throwing an exception. The php manual recommends against using strtolower()
because it can have unexpected results and that the mb equivalents should be used when working with a multibyte character format. And recommends using the ext-mbstring
if you are working with mb strings, or even if you are't, It usually doesn't matter much, but there are so many mb characters in the PSL that it seems like moving towards a fully multibyte code path would make sense. My read on the warning in the manual is that it seems like using non multibyte strtolower could have the same potential logic problem of false negatives as waiting until after the text is already US-ASCII
and or "unpredictable" behavior.
Still, using mb_
does seem like something for the next major revision, as it is a more drastic change than I had though.
I ran across another way to lowercase a multibyte string in php while I was triple checking character by character the encoding of the file I was using. I'll take a closer look at that and see if it might be a better option.
@dlindberg I think you are missing some key elements here.
idn_to_ascii
takes care of all the case sensitive issue on non-ASCII characters see https://3v4l.org/iTRPJ for instance after all, domain name are case insensitive and are preferred lowercased for maximum interoperability. That's why strtolower
or mb_strtolower
are not needed for IDN. mb_strtolower
has another thing going against its usage, it has a hidden second parameter that when omitted can modify the function output depending on the client server configuration see mb_strtolower.Pdp\Converter
expects a PSL encoded in utf-8
which is simply not the case in your issue as I demonstrated earlier.
That's why I believe that your patch is trying to fix a non existing issue.In that case my concern on false negatives was unfounded.
The hidden parameter is a good point, the other function I was thinking of is also in the ext-mbstring
extension, so does not resolve the dependency issue. But it is a much more explicit / better in the un-expected behaviors department. mb_convert_case($string, MB_CASE_LOWER, 'UTF-8')
where all three parameters are required.
It seems like you have a better solution; but, I don't think I'm imagining things. Using strtolower()
on a multibyte characters can be unpredictable. Solving that issue is the intent of the ext-mbstring
extension..
To your example, from the command line:
#!/usr/bin/env php
<?php
$string = 'éééABcdéÉÁກ大众汽车';
echo $string . PHP_EOL;
echo strtolower($string) . PHP_EOL;
echo mb_convert_case($string, MB_CASE_LOWER, 'UTF-8') . PHP_EOL;
echo idn_to_ascii(strtolower($string), 0, INTL_IDNA_VARIANT_UTS46, $arr) . PHP_EOL;
var_dump($arr);
echo idn_to_ascii(mb_strtolower($string), 0, INTL_IDNA_VARIANT_UTS46, $arr) . PHP_EOL;
var_dump($arr);
Outputs:
éééABcdéÉÁກ大众汽车
���abcd���ກ大众汽车
éééabcdééáກ大众汽车
/Users/example:10:
array(3) {
'result' =>
string(37) "���abcd���ກ大众汽车"
'isTransitionalDifferent' =>
bool(false)
'errors' =>
int(128)
}
xn--abcd-8na7baaea5939a273m1l1abu0dfp8f
/Users/example:12:
array(3) {
'result' =>
string(39) "xn--abcd-8na7baaea5939a273m1l1abu0dfp8f"
'isTransitionalDifferent' =>
bool(false)
'errors' =>
int(0)
}
@dlindberg where you are correct is in the usage of strtolower
on non-ASCII characters that may introduce some type of error. But again this is tangent to your issue. your main issue is getting the PSL in proper utf-8 format.
I've already address the strtolower
issue by making it sure it is only used on ASCII characters throughout the code (see the develop branch). So again there is no need to add mbstring
dependency when it is not needed at all.
Yes, you are correct, the new build works, all tests passing. I sometimes get a little stuck on the why of things in a not particularly helpful way. As in I really do want to know why strtolower
is mangling multibyte strings on my development machine and not a vm I spun up and literally diffed the output of php -i
. I even thought for a little while that mbstring.func_overload
got flipped on someplace by accident, but no... Anyway, thats really not important right now. I really do want to thank both you and Jeremy, the code quality on this project is really top notch, and I really appreciate both your responsiveness and patience in helping me with this.
Introduction
Issue #216 issues with the parser failing when
idnToAscii();
encounters illegal characters from an apparently good download of public_suffix_list.dat.Proposal
Use multibyte strtolower();
Describe the new/upated/fixed feature
Change from
strtolower()
tomb_strtolower
for the parser.Backward Incompatible Changes.
None known, function is recommended in the php manual for working with multibyte unicode strings.
Targeted release version
Next patch release.
PR Impact
Possibility that it will resolve some other edge behaviors where strings may be mangled.
Open issues
No open issues, relates to closed question #216. Possible future project, update all string functions to their multibyte variants.