jeremykendall / php-domain-parser

Public Suffix List based domain parsing implemented in PHP
MIT License
1.16k stars 128 forks source link

Incorrectly parses long suffixes with missing levels #321

Closed ash-m closed 2 years ago

ash-m commented 2 years ago

Issue summary

There are a few long suffixes defined by publicsuffix.org which do not list smaller suffixes. Specifically:

As an example, users.scale.virtualcloud.com.br but virtualcloud.com.br is not. So the suffix for it should be parsed as just com.br.

It looks like to me (per my brief look at the cached data) that the data is parsed as a simple label map where each array contains a nested array of labels at a deeper level.

If things should work the way I expect, then I think the way to approach this is temporarily hold a flat array of values from PSL, and for each suffix look for it's parent level suffix via shifting off each label; the resulting array for the above case would look something like:

[ 'com' => [ 'br' => [ 'users.scale.virtualcloud' ]]]

Although that might complicate the actual comparison logic, so it might be better to map the entire suffix rather than just the labels which would result in something like this:

[ 'com' => [ 'com.br' => [ 'users.scale.virtualcloud.com.br' ]]]

System informations

Information Description
Pdp version 6.1
PHP version 8.1
OS Platform Alpine

Standalone code, or other way to reproduce the problem

(Please complete the text below to help us fix the issue)

Expected result

Actual result

nyamsprod commented 2 years ago

Hi @ash-m thanks for using the library. Since I did not see any code snippet that would have helped me understand which piece of code you are using I am going to guess that you are using Rules:resolve. As explain in the documentation

The resolve method will always return a ResolvedDomain even if the domain syntax is invalid or if there is no match found in the resource data. To work around this limitation, the library exposes more strict methods, namely:

Rules::getCookieDomain Rules::getICANNDomain Rules::getPrivateDomain

So running the following snippet should make you understand what is going on:

$domain = 'users.scale.virtualcloud.com.br';
foreach (['resolve', 'getCookieDomain', 'getICANNDomain', 'getPrivateDomain'] as $method) {
    try {
        echo "Resolution of the domain name `$domain` by `Rules::$method`", PHP_EOL;
       /** @var Pdp\Rules publicSuffixList */
        $result = $publicSuffixList->$method($domain);
        var_export([
            'domain' => $result->value(),
            'registrableDomain' => $result->registrableDomain()->value(),
            'subDomain' => $result->subDomain()->value(),
            'publicSuffix' => $result->suffix()->value(),
            'isKnown' => $result->suffix()->isKnown(),
            'isICANN' => $result->suffix()->isICANN(),
            'isPrivate' => $result->suffix()->isPrivate(),
        ]);
    } catch (Throwable $exception) {
        echo "Unable to resolve `$domain` using `Rules::$method`: {$exception->getMessage()}";
    }
    echo PHP_EOL;
}

The same should be true for all the other domains in your list.

ash-m commented 2 years ago

Hi @nyamsprod sorry about the missing snippets, I'll fix that later.

I presume you're suggesting to use the ICANN method which parses that domain correctly, but I think the resolve method isn't working correctly.

Here's the output for node857-gelofesta.users.scale.virtualcloud.com.br:

array (
  'domain' => 'node857-gelofesta.users.scale.virtualcloud.com.br',
  'registrableDomain' => 'node857-gelofesta.users.scale.virtualcloud.com.br',
  'subDomain' => NULL,
  'publicSuffix' => 'users.scale.virtualcloud.com.br',
  'isKnown' => true,
  'isICANN' => false,
  'isPrivate' => true,
)

users.scale.virtualcloud.com.br is listed in the PSL.

Here is the output for clientportal.virtualcloud.com.br:

array (
  'domain' => 'clientportal.virtualcloud.com.br',
  'registrableDomain' => 'clientportal.virtualcloud.com.br',
  'subDomain' => NULL,
  'publicSuffix' => 'virtualcloud.com.br',
  'isKnown' => true,
  'isICANN' => false,
  'isPrivate' => true,
)

However virtualcloud.com.br is not listed in the PSL. Note that ICANN is the only method that parses this correctly which presents a problem if you were parsing both of these domains unless you're only interested in ICANN.

I don't know if this is intended behavior based on what typically happens when a suffix is not listed (it just returns the last label).

nyamsprod commented 2 years ago

@ash-m it is intended behaviour. that is why the package exposes sereval methods. the Rules::resolve method is built against a standard test suite (https://github.com/publicsuffix/list/blob/master/tests/test_psl.txt) which does not allow failure so the resulting "error" is expected by it. To be more strict or to know the correct error I would suggest the use of one of the get* method.

ash-m commented 2 years ago

Given the desired suffix output for: node857-gelofesta.users.scale.virtualcloud.com.br => users.scale.virtualcloud.com.br (not ICANN) and clientportal.virtualcloud.com.br => com.br (ICANN only)

then how would you design an adequate test for this?

Both desired suffixes are in the PSL.

nyamsprod commented 2 years ago

I may have found a way to fix the issue but I will probably not release it before next week because of time constraint.

ash-m commented 2 years ago

Neat! If you don't get around to it before I get some free time I'll submit a PR.

ash-m commented 2 years ago

I was writing this before I came across your library to use PSL. I don't know if it'll be helpful:

<?php

$data = <<<'EOF'
br
com.br
users.scale.virtualcloud.com.br
EOF;

final class SuffixList
{
    private array $index = [];

    public function __construct(public readonly array $items) {
        $this->parse();
    }

    private function parse(): void
    {
        foreach ($this->items as $item) {
            $this->indexItem($item);
        }
    }

    private function indexItem(string $item): void
    {
            $parents[] = $parent = $item;
            while ($parent = $this->getParent($parent)) {
                $parents[] = $parent;
            }

            $this->index = $this->mapParents($parents, $this->index);
    }

    private function mapParents(array $parents, array $index): array
    {
        $parent = array_pop($parents);
        if (!empty($parents)) {
            if (isset($index[$parent])) {
                $node = $index[$parent];
            } else {
                $node = [];
            }
            $index[$parent] = $this->mapParents($parents, $node);
        } else {
            $index[$parent] = [];
        }

        return $index;
    }

    private function getParent(string $item): ?string
    {
        $item .= '.';
        while (!empty($item)) {
            $item = substr($item, strpos($item, '.') + 1);
            $parent = rtrim($item, '.');
            if (in_array($parent, $this->items)) {
                return $parent;
            }
        }

        return null;
    }

    public function resolve(Domain $domain): string
    {
        $node = $this->index;
        foreach ($domain->getSuffixes() as $suffix) {
            if (isset($node[$suffix])) {
                $listedSuffix = $suffix;
                $node = $node[$suffix];
            }
        }

        return $listedSuffix;
    }
}

final class Domain
{
    private array $labels;

    public function __construct(public readonly string $value) 
    {
        $this->labels = explode('.', $value);
    }

    public function getSuffixes(): Generator
    {
        $labels = $this->labels;
        while ($label = array_pop($labels)) {
            $suffix = isset($suffix) ? "$label.$suffix" : $label;
            yield $suffix;
        }    
    }

    public function __toString(): string
    {
        return $this->value;
    }
}

// $list = new SuffixList(file($file, FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
$list = new SuffixList(explode("\n", $data));

$domains = [
    'node857-gelofesta.users.scale.virtualcloud.com.br',
    'clientportal.virtualcloud.com.br'
];

foreach ($domains as $domain) {
    $domain = new Domain($domain);
    $result = $list->resolve($domain);
    echo "The suffix for '$domain' is '$result'.\n";
}

The basic idea is store the entire PSL in an array, then loop each item and find it's parent recursively. Then check against a domain using a Generator.

nyamsprod commented 2 years ago

The fix is in the PR maybe you can try it by using the branch but as I said I won't be releasing it today. Probably monday.

IF you reuse the script I share with the current PR Rules::resolve and the other Rules::get* methods behave as expected and your issue is fixed 😉

nyamsprod commented 2 years ago

version 6.1.1 is released with the fix.

blaaat commented 1 year ago

HI @nyamsprod, I still have issues with domain tim.it, empty suffix. Seems to related to this issue

nyamsprod commented 1 year ago

@blaaat could you open a new issue with the corresponding script that produces the issue ?