[BUG] - Parsing error on URLs ending in ca.com (e.g: geteduca.com)

john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).

BSD 3-Clause "New" or "Revised" License

1.81k stars 211 forks source link

[BUG] - Parsing error on URLs ending in ca.com (e.g: geteduca.com) #295

Closed mdolr closed 1 year ago

mdolr commented 1 year ago

Hello,

My scrapper encoutered a bug with what seemed like a normal URL

Actual behavior

URL: https://www.geteduca.comfailed to be parsed by tldextract.extract(url) with result being : ExtractResult(subdomain='', domain='', suffix='').

Expected behavior

I would expect to receive ExtractResult(subdomain='www', domain='geteduca', suffix='com')

Thank you for looking into it, I'll try to submit a fix if I have the time 😄

elliotwutingfeng commented 1 year ago

I'm getting the correct results on CPython 3.10.11 and PyPy 3.9.16.

import tldextract; tldextract.extract("https://www.geteduca.com")

Can you let us know your Python version and verify if you are using tldextract >=3.4.4?

mdolr commented 1 year ago

Hey sorry, I've updated my packages and Python but cannot reproduce it anymore. I don't know what happened :/