matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.12k forks source link

Youtube captions (link previews) are useless #9733

Closed eras closed 3 years ago

eras commented 3 years ago

Description

At some point Youtube has updated the site and now all (?) captions generated by Synapse for the site are:

Before you continue to YouTube Sign in a Google company Before you continue to YouTube Google uses cookies and data to: Deliver and maintain services, like tracking outages and protecting against spam, fraud, and abuse Measure audience engagement and site statistics to understand how our services are used

This is basically useless considering the primary point of the function, in particular in the case of a very popular website.

Steps to reproduce

Expected results:

Authentic recordings from inside Hetzner Online's data center park Just like birds and insects, each server sings its own unique song.

Version information

ShadowJonathan commented 3 years ago

(FTR: This is about link previews)

This is not neccecarily a problem with synapse, synapse is doing it's job perfectly by previewing the url as-is fetched, because matrix.org's server is located within the EU, Google has a tendency (heh) to present users with the cookie page before letting them access any part of the site, by law.

ShadowJonathan commented 3 years ago

https://meta.discourse.org/t/youtube-embeddings-have-stopped-working-for-servers-in-europe/185128

:thinking:

eras commented 3 years ago

I agree that it's not particularly a bug in Synapse; however the only parties able to resolve this issue are Google and Synapse (or the 3rd party component it's using), and I have my doubts about Google doing anything about it :).

IIRC e.g. Slack doesn't have this issue, so it's resolvable; even if with special handling.

eras commented 3 years ago

For one plausible solution consider the following session:

% curl -s -A Mozilla -I https://www.youtube.com/watch?v=RzJf02TIqxk | grep -e '^HTTP' -e '^location'
HTTP/2 302 
location: https://consent.youtube.com/m?continue=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DRzJf02TIqxk&gl=FI&m=0&pc=yt&uxe=23983172&hl=fi&src=1

% curl -s -I https://www.youtube.com/watch?v=RzJf02TIqxk | grep -e '^HTTP' -e '^location'        
HTTP/2 200 
richvdh commented 3 years ago

ohh bother. we had this with twitter (https://github.com/matrix-org/synapse/issues/7643).

It looks like we should do the same trick as we did with them (hardcode a mapping to the oembed api):

$ curl -A Mozilla 'https://www.youtube.com/oembed?url=https%3A//www.youtube.com/watch%3Fv%3DRzJf02TIqxk&format=json' 
{"title":"PURE RELAXATION - SERVER SOUNDS","author_name":"Hetzner","author_url":"https://www.youtube.com/c/HetznerOnline","type":"video","height":113,"width":200,"version":"1.0","provider_name":"YouTube","provider_url":"https://www.youtube.com/","thumbnail_height":360,"thumbnail_width":480,"thumbnail_url":"https://i.ytimg.com/vi/RzJf02TIqxk/hqdefault.jpg","html":"\u003ciframe width=\u0022200\u0022 height=\u0022113\u0022 src=\u0022https://www.youtube.com/embed/RzJf02TIqxk?feature=oembed\u0022 frameborder=\u00220\u0022 allow=\u0022accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\u0022 allowfullscreen\u003e\u003c/iframe\u003e"}
alturiak commented 3 years ago

I guess, this will be affecting an increasing number of (less high-profile) sites as well, such as https://www.golem.de (a german news-portal). Hardcoding exceptions for youtube is certainly warranted - but in the long run, it might be nice to be able to specify custom hooks in synapse's configuration, although I'm not sure if that's really worth the effort.

clokep commented 3 years ago

it might be nice to be able to specify custom hooks in synapse's configuration, although I'm not sure if that's really worth the effort.

This shouldn't be too hard, it would also be nice to default to using the documented providers (https://oembed.com/providers.json).

ShadowJonathan commented 3 years ago

This shouldn't be too hard, it would also be nice to default to using the documented providers (https://oembed.com/providers.json).

Oooo, thanks for mentioning that, shouldn't that just be preloaded and used directly when URL previews are enabled?

clokep commented 3 years ago

This shouldn't be too hard, it would also be nice to default to using the documented providers (oembed.com/providers.json).

Oooo, thanks for mentioning that, shouldn't that just be preloaded and used directly when URL previews are enabled?

It should probably be tried. I don't know if it will regress other previews. 🤷

licentiapoetica commented 3 years ago

also on Hetzner, experiencing the same issue

ItsCinnabar commented 3 years ago

If anyone wants a temporary user sided fix for themselves, I made this tampermonkey script : https://gist.github.com/ItsCinnabar/ebcfe4f6b3ea7d224a8e1ef0783edeb2

Just edit the match url to your site and load it into tampermonkey/greasemonkey/etc

licentiapoetica commented 3 years ago

I found a way how to get it working again, you need to change your user agent to curl https://github.com/matrix-org/synapse/blob/5a153772c197a689df6c087e49d7bd8beee5dbdd/synapse/http/client.py#L321 replace to something like this: self.user_agent = "curl/7.59.0"

now youtube previews are working again

alturiak commented 3 years ago

I found a way how to get it working again, you need to change your user agent to curl https://github.com/matrix-org/synapse/blob/5a153772c197a689df6c087e49d7bd8beee5dbdd/synapse/http/client.py#L321

replace to something like this: self.user_agent = "curl/7.59.0" now youtube previews are working again

This works for youtube (which is great, thanks!), but it's not a silver bullet as it depends on how the sites handles different user-agents, so a more versatile approach might still be warranted.

licentiapoetica commented 3 years ago

I found a way how to get it working again, you need to change your user agent to curl https://github.com/matrix-org/synapse/blob/5a153772c197a689df6c087e49d7bd8beee5dbdd/synapse/http/client.py#L321

replace to something like this: self.user_agent = "curl/7.59.0" now youtube previews are working again

This works for youtube (which is great, thanks!), but it's not a silver bullet as it depends on how the sites handles different user-agents, so a more versatile approach might still be warranted.

yeah, you are right, but for now I think it suits me personally very well and I havnt encountered any url preview problem by now, I guess to make it youtube.com specific you would need to implement some if check for youtube specific and anything else just makes requests through the matrix user agent

igeljaeger commented 3 years ago

I found a way how to get it working again, you need to change your user agent to curl https://github.com/matrix-org/synapse/blob/5a153772c197a689df6c087e49d7bd8beee5dbdd/synapse/http/client.py#L321

replace to something like this: self.user_agent = "curl/7.59.0" now youtube previews are working again

this also fixes previews for sites like anilist.co that only displayed a "please use a modern browser" error message before editing this.

kuon commented 3 years ago

Setting the user agent to curl can be a problem for some other site, I remember it being blocked on some occasion.

Unfortunately, having worked on a framework like embed.ly in the past, it is easy to get to 90%, but the last 10% can be really difficult.

What we ended up doing was having our own user agent on the first try, but if the returned content was blocked, we tried again with google bot and other crawler user agent (facebook, twitter...). But some website can get really smart, I remember some validating the user agent with TCP TTL (IIRC windows is 128 and linux is 64).

I don't know what the best fix would be for synapse. Maybe the user agent could be configurable? Also maybe it could be configurable to use some external API or external command line tool on the home server.

In the end, having nice preview inline is crucial to a good user experience, but it is really hard to get right.

richvdh commented 3 years ago

I still think the best fix is to use the oembed api. Changing the useragent is a hack and is always going to be brittle.

licentiapoetica commented 3 years ago

well this was labeled as s-minor, it seems the devs dont give a damn since they are not in the eu with their instances and if nobody gives a damn about implementing this oembed api for youtube there are 2 solutions, the user agent hack or hosting the synapse somewhere where this please sign in to youtube preview does not happen.

also I havnt had any trouble with curl as my user agent in synapse, everything works perfectly fine so far

kuon commented 3 years ago

well this was labeled as s-minor, it seems the devs dont give a damn since they are not in the eu with their instances and if nobody gives a damn about implementing this oembed api for youtube there are 2 solutions, the user agent hack or hosting the synapse somewhere where this please sign in to youtube preview does not happen.

also I havnt had any trouble with curl as my user agent in synapse, everything works perfectly fine so far

Well, I don't think this tone is helpful. We are all trying to make things better.

Anyway, I agree that the user agent hack is brittle, per my experience it is not really a solution. But I also know it requires a lot of work to generate good previews. OEmbed is part of the solution and should be supported at some point, but having a configurable user agent can be a quick fix that shouldn't harm anything.

But the work involved to support OEmbed shouldn't be that big, if we look at https://github.com/webrecorder/oembed.link it is not that huge.

clokep commented 3 years ago

But the work involved to support OEmbed shouldn't be that big, if we look at webrecorder/oembed.link it is not that huge.

Maybe it wasn't explicit enough above, but OEmbed is already supported (see #7920). It currently hard-codes Twitter as the only supported service (see https://github.com/matrix-org/synapse/blob/4b965c862dc66c0da5d3240add70e9b5f0aa720b/synapse/rest/media/v1/preview_url_resource.py#L72-L86).

Options to solve this would be:

  1. Add YouTube as another hard-coded service (kind of meh, but if it is really broken this might be OK).
  2. Support pulling the list dynamically (or bundle the JSON list with the package and load it at run-time) -- this is the idea discussed in https://github.com/matrix-org/synapse/issues/9733#issuecomment-814111058.
  3. Allow for configuration of this list so people can do this themselves (also kind of meh since it requires each admin to fix this individually).
  4. Some combination of the above.

If someone is interested in working on this I'll gladly help work through any of the above with them, but that is likely a discussion for #synapse-dev:matrix.org.

kuon commented 3 years ago

I think using the list mentioned in https://github.com/matrix-org/synapse/issues/9733#issuecomment-814111058 is the way to go, and maybe make it use configurable (list URL).

So:

seems a good approach

Bubu commented 3 years ago

I just wanted to note that adding @tulir's "UrlPreviewBot" UA workaround fixed both twitter image previews as well as youtube previews for me. :tada:.

https://mau.dev/maunium/synapse/-/commit/55d926999cffee893cb4951890a33985beaf70ba

t3chguy commented 3 years ago

I'm taking a quick stab at this, by putting the oembed_globs in config, later possibly defaulting the sample config to derive from https://oembed.com/providers.json

Edit: so unfortunately this is not quite as trivial, Youtube's oEmbed response is an iframe which we can't send over the preview_url API.

e.g

{
  "title": "The Giant Comes to Life...(POWER LOADER: PART 14)",
  "author_name": "Hacksmith Industries",
  "author_url": "https://www.youtube.com/c/theHacksmith",
  "type": "video",
  "height": 113,
  "width": 200,
  "version": "1.0",
  "provider_name": "YouTube",
  "provider_url": "https://www.youtube.com/",
  "thumbnail_height": 360,
  "thumbnail_width": 480,
  "thumbnail_url": "https://i.ytimg.com/vi/62tPTgpmT1U/hqdefault.jpg",
  "html": "\u003ciframe width=\u0022200\u0022 height=\u0022113\u0022 src=\u0022https://www.youtube.com/embed/62tPTgpmT1U?feature=oembed\u0022 frameborder=\u00220\u0022 allow=\u0022accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\u0022 allowfullscreen\u003e\u003c/iframe\u003e"
}

image

vs Twitter which has no title but sends a blockquote we send over to the client

{
  "url": "https:\/\/twitter.com\/CroydonCyclists\/status\/1147416388874768389",
  "author_name": "Croydon Cycling Campaign",
  "author_url": "https:\/\/twitter.com\/CroydonCyclists",
  "html": "\u003Cblockquote class=\"twitter-tweet\"\u003E\u003Cp lang=\"en\" dir=\"ltr\"\u003ETurns out that Lime bike will fine you for parking their bikes in parts of central Croydon where cycling is legal and there are parking racks. Beyond stupid. \u003Ca href=\"https:\/\/t.co\/EtDlbUSfog\"\u003Epic.twitter.com\/EtDlbUSfog\u003C\/a\u003E\u003C\/p\u003E— Croydon Cycling Campaign (@CroydonCyclists) \u003Ca href=\"https:\/\/twitter.com\/CroydonCyclists\/status\/1147416388874768389?ref_src=twsrc%5Etfw\"\u003EJuly 6, 2019\u003C\/a\u003E\u003C\/blockquote\u003E\n\u003Cscript async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"\u003E\u003C\/script\u003E\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https:\/\/twitter.com",
  "version": "1.0"
}

image

Edit2:

With some tweaking, I can get some better results out of it, but the code needs a bit of refactoring, all the oEmbed results go through a media/file interface and its not appropriate.

image

nukeop commented 3 years ago

I'm suffering from this issue as well. Youtube previews are of poor quality even when they work, just compare it to how Discord or Slack handles it.

Youtube executives need to have something very nasty done to them for all the dark patterns they started going bonkers on to trick you into giving "consent". Of course this consent is not valid from GDPR perspective, as refusing should be as easy as giving it, and it should under no circumstances limit access.

ShadowJonathan commented 3 years ago

Discord has some custom behaviour and design for youtube specifically, FYI. it's intended to be invisible, but that kind of special treatment is a bit problematic for element.

nukeop commented 3 years ago

Sometimes these popups and other spam can be bypassed by using a fake useragent, like the one the google bot uses, maybe it could work here?

Bubu commented 3 years ago

@nukeop please look a few comment above where I linked to a commit which resolves this problem by basically mentioning 'bot' in the useragent for preview requests.

nukeop commented 3 years ago

Is there an eta on this being available in a release? Apparently it works for clients connecting to matrix.org, but not other homeservers?

t3chguy commented 3 years ago

Well its not even merged, so no, no eta whatsoever.

aaronraimist commented 3 years ago

Apparently it works for clients connecting to matrix.org, but not other homeservers?

@nukeop No. It works for servers located outside of Europe. It is broken for servers in the EU or UK like matrix.org.

damentz commented 3 years ago

@aaronraimist so what you mean is effectively 99.9% are affected, and anyone who is self hosting in the US is unaffected? If this is a ploy to get people to self host, it's working.

nukeop commented 3 years ago

@aaronraimist so what you mean is effectively 99.9% are affected, and anyone who is self hosting in the US is unaffected? If this is a ploy to get people to self host, it's working.

I'm self hosting and I'm affected. It's more of a ploy to get people to switch to Discord, which doesn't have these problems.

richvdh commented 3 years ago

I've removed the conspiracy theories, suggestions of workarounds that have already been discussed 5 times, and "me too!" comments. None of these are helpful; please stay on topic. Yes it's annoying, no it's not a conspiracy by the evil Synapse maintainers to make your life worse.

We know it's possible to work around the problem by changing the User-agent. Per https://github.com/matrix-org/synapse/issues/9733#issuecomment-834348426: I'd rather not do that as I think it will be brittle.

Props to @t3chguy who, rather than complaining about the problem, has started work on a PR to fix it.

nukeop commented 3 years ago

Why so defensive?

t3chguy commented 3 years ago

As a maintainer it is draining to see users spewing such garbage about something you put so much time into.

nukeop commented 3 years ago

You can take this opportunity to identify issues that people find important enough to comment on... or you can get defensive and lash out on your users for caring about your software.

richvdh commented 3 years ago

I'm going to take further discussion of the oembed implementation to #2752.

richvdh commented 3 years ago

10714 has made good progress on this by changing the preview API to use a configurable list of oEmbed providers; however youtube previews are still somewhat useless as the default provider list doesn't include an entry for youtube.

@clokep are you aware of any reason we shouldn't include an entry for youtube in that file by default?

clokep commented 3 years ago

@clokep are you aware of any reason we shouldn't include an entry for youtube in that file by default?

oEmbed for YouTube doesn't really give a good response right now, in the image below the first preview is made without using oEmbed (but I'm in the US so I get a "real" description), while the second one is made with oEmbed:

image

I think the tweaks in #10392 were meant to make this preview better.

richvdh commented 3 years ago

oh I see. So really we need to land the remaining tweaks in #10392 before we can make more progress here?

clokep commented 3 years ago

oh I see. So really we need to land the remaining tweaks in #10392 before we can make more progress here?

Yeah, pretty much. I'm not super thrilled with the flow right now of how we do previews when using oEmbed, but that's rather tough to crack apart. It could really use some documentation on where caches are and such.

I Think the gist is that we need to pull more info out of the oEmbed response though, e.g. the provider_name and title don't seem to end up properly in the response right now.

Here's what we get from oEmbed:

{
   "author_name" : "Rick Astley",
   "author_url" : "https://www.youtube.com/c/RickastleyCoUkOfficial",
   "height" : 113,
   "html" : "<iframe width=\"200\" height=\"113\" src=\"https://www.youtube.com/embed/dQw4w9WgXcQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>",
   "provider_name" : "YouTube",
   "provider_url" : "https://www.youtube.com/",
   "thumbnail_height" : 360,
   "thumbnail_url" : "https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg",
   "thumbnail_width" : 480,
   "title" : "Rick Astley - Never Gonna Give You Up (Official Music Video)",
   "type" : "video",
   "version" : "1.0",
   "width" : 200
}

What we get from Synapse (when configured to use oEmbed for YouTube):

{
   "matrix:image:size" : 18498,
   "og:description" : null,
   "og:image" : "mxc://localhost:8480/2021-09-01_AfteoaZUTZOUJfoa",
   "og:image:height" : 360,
   "og:image:type" : "image/jpeg",
   "og:image:width" : 480
}

This is really only pulling the thumbnail_url properly right now.

For reference, this compares to what we get without using oEmbed:

{
   "matrix:image:size" : 65665,
   "og:description" : "Rick Astley's official music video for “Never Gonna Give You Up” Subscribe to the official Rick Astley YouTube channel: https://RickAstley.lnk.to/YTSubIDFoll...",
   "og:image" : "mxc://localhost:8480/2021-09-01_QwaVetzmVlEviNmK",
   "og:image:height" : 720,
   "og:image:type" : "image/jpeg",
   "og:image:width" : 1280,
   "og:site_name" : "YouTube",
   "og:title" : "Rick Astley - Never Gonna Give You Up (Official Music Video)",
   "og:type" : "video.other",
   "og:url" : "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
   "og:video:height" : "720",
   "og:video:secure_url" : "https://www.youtube.com/embed/dQw4w9WgXcQ",
   "og:video:tag" : "rick astley never gonna give you up lyrics",
   "og:video:type" : "text/html",
   "og:video:url" : "https://www.youtube.com/embed/dQw4w9WgXcQ",
   "og:video:width" : "1280"
}
clokep commented 3 years ago

I put up #10819 which should help with this, but it doesn't give quite as good of a preview as the current HTML parsing.

I've been unable to reproduce the blank / no preview for YouTube from US, UK, or France based servers. Are people still seeing issues with this?

evoL commented 3 years ago

I get URL previews for YouTube now.

I think YouTube rolled out a change where they don't auto-redirect to consent.youtube.com anymore. I remember that some weeks ago the redirect happened on and off for me, which looked to me like an A/B test on their part. Maybe it's fully rolled out yet?

asmaps commented 3 years ago

I get URL previews for YouTube now.

I think YouTube rolled out a change where they don't auto-redirect to consent.youtube.com anymore. I remember that some weeks ago the redirect happened on and off for me, which looked to me like an A/B test on their part. Maybe it's fully rolled out yet?

Same here, started working from Germany without updating synapse.

clokep commented 3 years ago

Thank you @evoL and @asmaps! I'm going to close this for now then. If someone is seeing issues still, please shout!