Open mweinelt opened 4 years ago
Just to provide a link, youtube is doing it https://www.youtube.com/watch?v=5PmHRSeA2c8
A bit offtopic.. what to you guys think about adding some kind of hook to the web plugin. So a 3rd plugin could add a url pattern and an function which should get called, if the pattern matches.
note that supybot.protocols.http.peekSize will probably need to be increased from default 8192, in youtube case it definitely does.
I'd rather have a single parser that fetches both instead of parsing the document twice, but it's a reasonable way to do it, yeah. Feel free to send a PR :)
The problem with YouTube is that it includes <title>YouTube</title>
early in document, and <meta property="og:title" content="Real video title">
is encountered much later, so to make things less complicated (and less optimal as you noted) I ended up with two ordered parsers.
It feels hacky, it's probably ok for personal bandaid use, but I feel more thought is needed for this to be included in Limnoria.
ugh :(
Some months ago I read about microbrowsers and changing the URL snarfer's user agent to so that sites send easily parseable metadata: https://24ways.org/2019/microbrowsers-are-everywhere/
Thelounge has implemented this since https://github.com/thelounge/thelounge/pull/3602, which appears to fix Amazon URLs for example.
Perhaps not so coincidentally, these are also exposed as <meta property="og:title" content="...">
tags.
Youtube now also removes og:title if it detects you're a bot and not logged in.
What I do right now to get a title is:
try:
content = utils.web.getUrl(url).decode('utf-8')
pattern = re.compile(r'title\\x22:\\x7b\\x22runs\\x22:\\x5b\\x7b\\x22text\\x22:\\x22(.*?)\\x22\\x7d\\x5d\\x7d')
match = pattern.search(content)
if match:
title = match.group(1)
domain = utils.web.getDomain(url)
s = format('Title: %s (at %s)', title, domain)
irc.reply(s, prefixNick=False)
else:
irc.reply("ytInitialData not found")
except Exception as e:
irc.reply(f"Error: {e}")
The source is something like:
…\x7b\x22slimVideoMetadataSectionRenderer\x22:\x7b\x22contents\x22:\x5b\x7b\x22slimVideoInformationRenderer\x22:\x7b\x22title\x22:\x7b\x22runs\x22:\x5b\x7b\x22text\x22:\x22THIS-IS-THE-TITLE\x22\x7d\x5d\x7d…
I'm not sure if there is any better solution. I don't get ytInitialData and ytInitialPlayerResponse says something like "login to make sure you aren't a bot".
Weird. Classic solution works fine for me, so far. Maybe because I don't bombard youtube with requests. I haven't changed default Limnoria useragent either.
Here's the classic patch against current Limnoria, retrieves video length along with title. https://gist.github.com/allixx/d42fdae04d59db38768c6dc2e9e4b68d
For now, have seen other bots running in this problem. I have this issue for a couple of weeks. Now I've seen the limnoria bot used in the arch channel is running in the Same problem.
here the website, I just let the bot post it in the log ythtml.log
What's the useragent value? I vaguely rembember providing desktop useragent string was not really helpful in youtube's case.
Don't know how custom arch channels Limnoria is, I think stock Limnoria does not handle og:title.
Some sites lack a
title
tag in their plain html site and are instead providing opengraph metadata. The URL plugin does not currently work for those.The opengraph title attribute looks like so:
It could be used when the
title
tag is missing.I'm currently lacking an example site for this behaviour, if I'll come across one I'll add it.
Anyway this first got noticed when twitter posts wouldn't get looked up any longer, but they went one step further and are even loading the og:title attribute via JS. Sad.