matrix-org / matrix-viewer

View the history of public and world readable Matrix rooms
https://archive.matrix.org
Apache License 2.0
74 stars 11 forks source link

Use URL hash fragment anchor for message permalink, add `id` attribute of message to jump on it #238

Open bkil opened 1 year ago

bkil commented 1 year ago

Include the Matrix event ID in the URI hash, ex:

https://archive.matrix.org/r/securemessagingapps:matrix.org/date/2023/05/30#$5cQZRtG9bsleXZI2x-s6wEDfeZ5B1nC_jEvOwpA-VdI

To make this work, we would also need to set the id attribute of each timeline message to the respective value (instead of the current data-event-id) so the browser will jump to it upon loading. You can use the :target CSS selector to highlight the matching message on the timeline with a different background and add a mark on the side as well.

If the backend for some reason would also need to access the event ID (without JavaScript) to return messages for the given date, consider adding it to both the query and the hash.

There were restrictions in former versions of HTML on the syntax of the ID, but from HTML5, it should be non-empty and can contain basically anything except whitespace:

MadLittleMods commented 1 year ago

@bkil What benefit are you trying to achieve with this? I assume you're after the permalink event scrolling into view even when JavaScript is disabled?

Please note, we're not specifically optimizing for the disabled JavaScript case but simpler and semantic is better in terms of search engines which we do care about. I don't think search engines care about scroll though :thinking:

We do need to set the ?at=$abc attribute on the server backend in order to set the continuation position as you're paginating backward and forward and have to take the query parameter into account for the the server-side rendered HTML to include the selected event metadata (URL previews), semantic attributes, styles, etc.

Duplicating the event ID in the hash and ?at=$abc query parameter seems like more hassle and noise than it's worth for the disabled JavaScript scroll benefit.

bkil commented 1 year ago

The way how it is generated at present is actually inferior from a SEO standpoint. You now generate hundreds of pages per day (differentiated by the ID in the URI query), all containing the exact same content, but interlinked somewhat with the major difference being invisible SEO metadata and the single class hand crafted on top of the highlighted message substituting :target.

Search engines have heuristics to detect such link farms and either penalize such results or downrank the whole domain for this.

If keeping the continuation token is unavoidable, it may be included as long as it remains the same across links pointing towards the same wall of messages

bkil commented 1 year ago

For inspiration, this is how IndieWeb generates their online archive (backed by a git repository and a bridge between Slack-IRC-Matrix) with excellent JS & noJS accessibility and optimized for SEO:

MadLittleMods commented 1 year ago

You now generate hundreds of pages per day (differentiated by the ID in the URI query)

@bkil Ahh, that's a really interesting point (especially in terms of caching)! But this seemed to work out fine for Gitter with the same URL pattern for permalinks.

I don't think the Matrix Public Archive really qualifies for a link farm or spamdexing. Having a permalink for an item is pretty standard. You can even see this with Discourse or StackExchange sites.

As an interesting point of comparison, in the case of StackExchange questions/answers, they do duplicate the answer ID in the URL and the hash (I assume the hash is for scrolling): https://stackoverflow.com/a/482129/796832 -> https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered/482129#482129

If keeping the continuation token is unavoidable, it may be included as long as it remains the same across links pointing towards the same wall of messages

I'm not sure about the distinction you're trying to make here? Can you give an example?

bkil commented 1 year ago

I also know of blog engines from the 90s that generate a similar URL including a message ID in both the hash and the query. Although, all such ranking algorithms are proprietary, I'd probably allow for including a tiny bit of context around each referenced message, however including the whole day worth of chat on each separate page would definitely not fly with me.

For tree-based or thread-based blog engines, this typically boils down to referring to a thread or subtree at a time, not the whole root every time.

In search engines I've tried, those results are ranked higher which are accessible through content-unique URLs. I.e., answers are not at the top, as they have been downranked by The Algorithm.

Your linked StackOverflow example also includes this crucial piece:

<link rel="canonical" href="https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered" />

bkil commented 1 year ago

Drawbacks of link differentiation via the query pointing to the same page:

Advantages:

MadLittleMods commented 1 year ago

Your linked StackOverflow example also includes this crucial piece:

<link rel="canonical" href="https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered" />

Please create a new separate issue about adding this (with the SO example) :fast_forward: -> https://github.com/matrix-org/matrix-public-archive/issues/251


For tree-based or thread-based blog engines, this typically boils down to referring to a thread or subtree at a time, not the whole root every time.

Reddit and Twitter are a good example of this but they are slightly different use cases since they support infinite nested levels of threads. Both include the permalink ID in the URL for reference.

Reddit even has a ?context=3 query parameter to specify the depth of surrounding messages to show. For a Matrix room, the context for a given event is just the surrounding messages (whether that be in the main timeline or thread timeline) which is what we're already showing.

It's unclear what impact on SEO that our current level of bulk surrounding messages has but it's also something we haven't measured and not something I'm particularly worried about this stage. Based on that experience with Gitter, I've seen plenty of relevant permalinks appear in Google. I'm leaning towards leaving things as-is.


In terms of the drawbacks you listed for using the ?at=$abc query parameter, we can't really get away from not including it in the URL since we want URL previews to work well.

And in terms of following a reply-chain without a page reload (as long as the messages are on the page), this isn't really relevant since we can still accommodate for that with the Hydrogen client-side JS.

Caching seems like the most impactful benefit we could get from changing but also not a total deal-breaker in my opinion with how it currently works.