jeremykendall / php-domain-parser

Public Suffix List based domain parsing implemented in PHP
MIT License
1.16k stars 128 forks source link

Chinese or other foreign domains get seen as invalid #349

Closed CollectionAgency closed 1 year ago

CollectionAgency commented 1 year ago

Issue summary

Domain: 漢字4.bbtest.net Is being seen as invalid domain

Error I'm getting: The host ??4.bbtest.net is invalid: it contains invalid characters.

System informations

(In case of a bug report Please complete the table below)

Information Description
Pdp version 6.1
PHP version 7.4
OS Platform Ubuntu 22.04

Standalone code, or other way to reproduce the problem

<?php
    use Pdp\Rules;
    use Pdp\Domain;

    class DomainParser {
        private $publicSuffixList;
        private $domainInfo;

        function __construct() {
            file_put_contents("public_suffix_list.dat", file_get_contents("https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat"));
            $this->publicSuffixList = Rules::fromPath('public_suffix_list.dat');
        }

        function parseDomain($domain) {

            $parsedDomain = Domain::fromIDNA2008($domain);

            $this->domainInfo   = $this->publicSuffixList->resolve($domain);
            $rootDomain         = $this->domainInfo->registrableDomain()->toString();
            $subDomain          = $this->domainInfo->subDomain()->toString();

            return array(
                'rootdomain' => $rootDomain,
                'subdomain' => $subDomain,
                'domain' => $this->domainInfo->domain()->toString(),
                'suffix' => $this->domainInfo->suffix()->toString(),
                'secondleveldomain' => $this->domainInfo->secondLevelDomain()->toString(),
                'registrabledomain' => $this->domainInfo->registrableDomain()->toString(),
                'suffix_iana' => $this->domainInfo->suffix()->isIANA()
            );

        }
    }
?>

Using this part to get the result, it's in a try catch clause ofcourse.

<?php
    require "vendor/autoload.php";
    use Pdp\Rules;
    use Pdp\Domain;

    include("app/classes/DomainParser.php");

    $domainParser   = new DomainParser();
    $domain = "漢字4.bbtest.net";

    try {
        $parsedDomain = $domainParser->parseDomain(mb_convert_encoding($domain, 'ISO-8859-1','utf-8'));
    } catch(Exception $e) {
        echo $e->getMessage();
    }

?>

Expected result

Look at the array it returns.

Actual result

Look at the array it returns.

Is this fixed in the latest version?

CollectionAgency commented 1 year ago

Alright nevermind, seems that converting it to ISO 88591 and UTF8 is the issue. I used to import the data into mongo, and mongo was the one complaining about strange chars.

So once removing mb_convert_encoding, it'll work.