john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Get the port from the provided URL to extract function #272

Closed cgr71ii closed 1 year ago

cgr71ii commented 1 year ago

Hi!

I was wondering if it is possible to get the port from the URLs when extract function is invoked (or other function). I guess it is not, since I didn't see it in the documentation and I've dug a little bit in the code and I didn't see anything related. I'm using this library in order to obtain URLs from a large list, and use those URLs in order to crawl, so I need the port in case it is defined. In case it is not possible to obtain the port, it is intended to implement this functionality?

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='', port='8080')

Thank you!

john-kurkowski commented 1 year ago

I took a stab at this in #273. I'm not sold on the solution as is. Feel free to chime in there. In the meantime, I suggest parsing the port with the standard library. Example:

split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
split_suffix = tldextract.extract(split_url.netloc)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"
john-kurkowski commented 1 year ago

As of #274, the above workaround can be tweaked slightly to avoid parsing the string twice:

split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
- split_suffix = tldextract.extract(split_url.netloc)
+ split_suffix = tldextract.extract_urllib(split_url)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"
john-kurkowski commented 1 year ago

After thinking about it, this library is focused on domain names, not every component of a URL. I defer URL parsing to Python's standard lib. I hope the workaround in the previous comment helps!