meyt / linkpreview

Get link preview in python
MIT License
46 stars 9 forks source link

Fetching domain name #5

Closed asmaier closed 3 years ago

asmaier commented 3 years ago

At the moment Linkpreview returns a preview object with information about title, description and image. I suggest to also return the real domain name. This is interesting information, especially when getting a link preview of short urls.

I found a workaround at the moment. However the disadvantage is, that one has to make two requests to an url to get all information:

import re
import linkpreview
from linkpreview import Link, LinkPreview, LinkGrabber
import requests
from urllib.parse import urlparse 

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}  
r = requests.get(url, headers=headers)
uri = urlparse(r.url)
domain = re.sub('^www\.', '', uri.netloc)

grabber = LinkGrabber()
link = Link(url, grabber.get_content(url, headers=headers))
preview = LinkPreview(link) 

It would be much nicer, if the preview object would hold the information about the domain directly, e.g. in the field preview.domain .

meyt commented 3 years ago

@asmaier You can access the domain through preview.link.netloc. its not neccessary to add new property for the LinkPreivew object. About the short urls, you need to extend the LinkGrabber and give the last redirected URI to LinkPreview, no need to extra request.

Fetching URL has more scenarios to handle, LinkPreview is focusing on Parsing the results and Grabber part is just helper for common use cases. (i've to mention it on README 😄)

You may need something like this for now:

import requests
from linkpreview import Link, LinkPreview

url = 'http://g.co/blob-opera';
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}  
req = requests.get(url, headers=headers)
preview = LinkPreview(Link(req.url, req.text)) 

print(preview.link.netloc)  # output: artsandculture.google.com