progval / Limnoria

A robust, full-featured, and user/programmer-friendly Python IRC bot, with many existing plugins.
https://docs.limnoria.net/
Other
619 stars 173 forks source link

OpenGraph / og:title in Web plugin #1421

Open mweinelt opened 4 years ago

mweinelt commented 4 years ago

Some sites lack a title tag in their plain html site and are instead providing opengraph metadata. The URL plugin does not currently work for those.

The opengraph title attribute looks like so:

<meta property="og:title" content="Mozilla Developer Network">

It could be used when the title tag is missing.

I'm currently lacking an example site for this behaviour, if I'll come across one I'll add it.

Anyway this first got noticed when twitter posts wouldn't get looked up any longer, but they went one step further and are even loading the og:title attribute via JS. Sad.

lodriguez commented 4 years ago

Just to provide a link, youtube is doing it https://www.youtube.com/watch?v=5PmHRSeA2c8

A bit offtopic.. what to you guys think about adding some kind of hook to the web plugin. So a 3rd plugin could add a url pattern and an function which should get called, if the pattern matches.

allixx commented 4 years ago
a bandaid solution to have something to start with (sorry for probably outdated limnoria version) ```diff --- plugins/Web/plugin.py.orig +++ plugins/Web/plugin.py @@ -33,6 +33,8 @@ import string import socket +from html.parser import HTMLParser + import supybot.conf as conf import supybot.utils as utils from supybot.commands import * @@ -81,6 +83,27 @@ if self.inHtmlTitle: super(Title, self).append(data) +class TitleMeta(HTMLParser): + entitydefs = entitydefs.copy() + entitydefs['nbsp'] = ' ' + entitydefs['apos'] = '\'' + + def __init__(self): + self.data = [] + super(TitleMeta, self).__init__() + + def handle_starttag(self, tag, attrs): + if tag == 'meta': + has_title = False + + for attrname, attrvalue in attrs: + if attrname == 'property' and attrvalue == 'og:title': + has_title = True + elif attrname == 'content': + if has_title: + self.data.append(attrvalue) + break + class DelayedIrc: def __init__(self, irc): self._irc = irc @@ -163,19 +186,24 @@ 'installing python-charade.)'), Raise=True) else: return None - try: - parser = Title() - parser.feed(text) - except UnicodeDecodeError: - # Workaround for Python 2 - # https://github.com/ProgVal/Limnoria/issues/1359 - parser = Title() - parser.feed(text.encode('utf8')) - parser.close() - title = utils.str.normalizeWhitespace(''.join(parser.data).strip()) - if title: - return (target, title) - elif raiseErrors: + + for p in [TitleMeta, Title]: + try: + parser = p() + parser.feed(text) + except UnicodeDecodeError: + # Workaround for Python 2 + # https://github.com/ProgVal/Limnoria/issues/1359 + parser = p() + parser.feed(text.encode('utf8')) + parser.close() + + title = utils.str.normalizeWhitespace(''.join(parser.data).strip()) + + if title: + return (target, title) + + if raiseErrors: if len(text) < size: irc.error(_('That URL appears to have no HTML title.'), Raise=True) ```

note that supybot.protocols.http.peekSize will probably need to be increased from default 8192, in youtube case it definitely does.

progval commented 4 years ago

I'd rather have a single parser that fetches both instead of parsing the document twice, but it's a reasonable way to do it, yeah. Feel free to send a PR :)

allixx commented 4 years ago

The problem with YouTube is that it includes <title>YouTube</title> early in document, and <meta property="og:title" content="Real video title"> is encountered much later, so to make things less complicated (and less optimal as you noted) I ended up with two ordered parsers.

It feels hacky, it's probably ok for personal bandaid use, but I feel more thought is needed for this to be included in Limnoria.

progval commented 4 years ago

ugh :(

jlu5 commented 3 years ago

Some months ago I read about microbrowsers and changing the URL snarfer's user agent to so that sites send easily parseable metadata: https://24ways.org/2019/microbrowsers-are-everywhere/

Thelounge has implemented this since https://github.com/thelounge/thelounge/pull/3602, which appears to fix Amazon URLs for example.

Perhaps not so coincidentally, these are also exposed as <meta property="og:title" content="..."> tags.

lodriguez commented 2 months ago

Youtube now also removes og:title if it detects you're a bot and not logged in.

What I do right now to get a title is:

           try:                                               
                content = utils.web.getUrl(url).decode('utf-8')                    
                pattern = re.compile(r'title\\x22:\\x7b\\x22runs\\x22:\\x5b\\x7b\\x22text\\x22:\\x22(.*?)\\x22\\x7d\\x5d\\x7d') 
                match = pattern.search(content)
                if match:                 
                    title = match.group(1) 
                    domain = utils.web.getDomain(url)
                    s = format('Title: %s (at %s)', title, domain)
                    irc.reply(s, prefixNick=False)
                else:                                   
                    irc.reply("ytInitialData not found")
            except Exception as e:      
                irc.reply(f"Error: {e}")

The source is something like:

…\x7b\x22slimVideoMetadataSectionRenderer\x22:\x7b\x22contents\x22:\x5b\x7b\x22slimVideoInformationRenderer\x22:\x7b\x22title\x22:\x7b\x22runs\x22:\x5b\x7b\x22text\x22:\x22THIS-IS-THE-TITLE\x22\x7d\x5d\x7d…

I'm not sure if there is any better solution. I don't get ytInitialData and ytInitialPlayerResponse says something like "login to make sure you aren't a bot".

allixx commented 2 months ago

Weird. Classic solution works fine for me, so far. Maybe because I don't bombard youtube with requests. I haven't changed default Limnoria useragent either.

Here's the classic patch against current Limnoria, retrieves video length along with title. https://gist.github.com/allixx/d42fdae04d59db38768c6dc2e9e4b68d

lodriguez commented 2 months ago

For now, have seen other bots running in this problem. I have this issue for a couple of weeks. Now I've seen the limnoria bot used in the arch channel is running in the Same problem. is just Youtube, only og set is og:url and the variables are empty</p> <p>here the website, I just let the bot post it in the log <a href="https://github.com/user-attachments/files/16325049/ythtml.log">ythtml.log</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/allixx"><img src="https://avatars.githubusercontent.com/u/1695323?v=4" />allixx</a> commented <strong> 2 months ago</strong> </div> <div class="markdown-body"> <p>What's the useragent value? I vaguely rembember providing desktop useragent string was not really helpful in youtube's case.</p> <p>Don't know how custom arch channels Limnoria is, I think stock Limnoria does not handle og:title. <title>Youtube is pretty normal for youtube page, og:title follows much later in html document (max download size limit must be increased to reach the actual og:title value).