layershifter / TLDExtract

[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List
Apache License 2.0
216 stars 34 forks source link

TLDExtract not properly parsing hostname #47

Open leem32 opened 5 years ago

leem32 commented 5 years ago

I'm running some domain names through TLDExtract and came across a domain not being properly parsed.

The URL is called blogspot.com

$url = 'blogspot.com';
$domain = tld_extract($url);
var_dump($domain);

Returns: 
object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'blogspot.com' (length=12)
  private 'suffix' => null

Weirdly the URL 'flogspot.com' works fine and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'flogspot' (length=8)
  private 'suffix' => string 'com' (length=3)

The URL logspot.com also works and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'logspot' (length=7)
  private 'suffix' => string 'com' (length=3)

Any idea why the TLD in 'blogspot.com' is not being added to the suffix? Is this a bug?

leem32 commented 5 years ago

I see blogspot.com is in the public_suffix_list.dat. What's going on here? Can't Layershifter parse any of the URL's in that list? Any workarounds?

https://github.com/publicsuffix/list/blob/6f2b9e75eaf65bb75da83677655a59110088ebc5/public_suffix_list.dat#L5884