matrix-org / matrix-viewer

View the history of public and world readable Matrix rooms
https://archive.matrix.org
Apache License 2.0
73 stars 11 forks source link

Think about `rel=canonical` linking #251

Closed bkil closed 1 year ago

bkil commented 1 year ago

Spawning from https://github.com/matrix-org/matrix-public-archive/issues/238#issuecomment-1568963014,

It is not trivial how to apply it here, but basically if multiple pages are only differentiated by a query argument and contain the exact same set of messages with only tiny changes (such as in its highlighting or in its preview metadata), they should be linked back together to a single canonical URL. A search engine crawler is free to throw away any and all alternative links which fold back to the same canonical one instead of indexing each of them separately.

Valid for link, it defines the preferred URL for the current document, which helps search engines reduce duplicate content.

See here for detailed explanation:

For example, StackExchange offers path-based routing for individual answers to a given question, but marks the document up so that each almost identical such document shall refer back to the path of the question as a canonical link:

https://stackoverflow.com/a/482129/796832 -> https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered/482129#482129

<link rel="canonical" href="https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered" />
MadLittleMods commented 1 year ago

Ahhh, based on your explanation of <link rel="canonical" href="..."> here I think I misunderstood the purpose.

I was thinking that <link rel="canonical" href="..."> pointed to the main document where you would find the permalinked item and search engines would consider the current URL as a special cased individual view of the event.

If search engines typically just use this to deduplicate results and avoid wasteful crawling, that's not necessarily a bad thing. I'm sure it would still highlight the relevant thing you're searching for in the search result and use the scroll to text fragment syntax (#:~:text=foo) when you visit the page but it seems like it may not use our ?at=$abc query parameter to link exactly to the relevant message.

It seems to work out for Reddit and StackExchange which all do this :shrug:. I think I'm in favor of adding this :fast_forward:

Relevant links:

bkil commented 1 year ago

A bit better explanation: https://en.wikipedia.org/wiki/Canonical_link_element

jonaharagon commented 1 year ago

I don't know whether this should be a separate issue here, but I would also like rel=canonical to be used to deduplicate matrix-public-archive instances as well, as discussed at https://github.com/matrix-org/matrix-spec-proposals/pull/4021#discussion_r1212714926:

If we wanted something specific to the Matrix Public Archive URL format, we could use an event type scoped to the sub-domain like org.matrix.archive.canonical to convey this information.

👍 Maybe this is something that should be implemented specifically for this client in the way you stated, as opposed to in an MSC. The more I think about that, the more it makes a lot more sense.

(The use-case for this is the same self-hosted community situation we talked about at https://github.com/matrix-org/matrix-public-archive/issues/234#issuecomment-1568741932)