snarfed / bridgy-fed

🌉 A bridge between decentralized social network protocols
https://fed.brid.gy
Creative Commons Zero v1.0 Universal
562 stars 30 forks source link

Hashtags with extended alphabet characters aren't recognized as hashtags, AP=>Bluesky #1131

Open MS-potilas opened 3 months ago

MS-potilas commented 3 months ago

AP Hashtags containing extended alphabet characters, like ä (a with dots) and ö (o with dots), aren't recognized as hashtags. They show as text in Bluesky.

Example: https://mementomori.social/@rolle/112586679114646311 https://bsky.app/profile/rolle.mementomori.social.ap.brid.gy/post/3kuikyelvzdc2

Here #Äänestäminen was not recognized as hashtag,

snarfed commented 3 months ago

Huh, this turned out to be more interesting than I though. Mastodon's AS2 JSON for this post removes the umlauts from those characters in the tag objects. It renders them in content and in the UI:

image

...but the AS2 tag has "name" : "#aanestaminen", no umlauts. Full object below.

Interestingly, if you click on the #Äänestäminen hashtag chip in the UI, it goes to the hashtag page, https://mementomori.social/tags/%C3%84%C3%A4nest%C3%A4minen , which has the umlauts, but they're only for show, evidently they're not in the underlying hashtag index. If you remove them from that URL to get https://mementomori.social/tags/Aanestaminen , it renders the hashtag without them but shows the same results.

{
   "type" : "Note",
   "id" : "https://mementomori.social/users/rolle/statuses/112586679114646311",
   "url" : "https://mementomori.social/@rolle/112586679114646311",
   "attributedTo" : "https://mementomori.social/users/rolle",
   "content" : "<p>Muista käydä äänestämässä! Klo 20 asti aikaa. On tyhmää olla vaikuttamatta, kun siihen demokratiassa on mahdollisuus. Kaikille maailmassa ei tällaista suoda.</p><p><a href=\"https://mementomori.social/tags/Eurovaalit2024\" class=\"mention hashtag\" rel=\"tag\">#<span>Eurovaalit2024</span></a> <a href=\"https://mementomori.social/tags/Eurovaalit\" class=\"mention hashtag\" rel=\"tag\">#<span>Eurovaalit</span></a> <a href=\"https://mementomori.social/tags/%C3%84%C3%A4nest%C3%A4minen\" class=\"mention hashtag\" rel=\"tag\">#<span>Äänestäminen</span></a> <a href=\"https://mementomori.social/tags/Politiikka\" class=\"mention hashtag\" rel=\"tag\">#<span>Politiikka</span></a></p>",
   "tag" : [
      {
         "href" : "https://mementomori.social/tags/eurovaalit2024",
         "name" : "#eurovaalit2024",
         "type" : "Hashtag"
      },
      {
         "href" : "https://mementomori.social/tags/eurovaalit",
         "name" : "#eurovaalit",
         "type" : "Hashtag"
      },
      {
         "href" : "https://mementomori.social/tags/aanestaminen",
         "name" : "#aanestaminen",
         "type" : "Hashtag"
      },
      {
         "href" : "https://mementomori.social/tags/politiikka",
         "name" : "#politiikka",
         "type" : "Hashtag"
      }
   ]
}
snarfed commented 3 months ago

I actually like this, it seems clever and a good UX idea, but it's definitely more difficult to translate. Bluesky uses index-based facets for hashtags and other rich text, but Mastodon's AS2 tags don't have indices, so we have to search for their name in the content, which doesn't work in this case because the name is the normalized text, eg #aanestaminen, which doesn't have the umlauts.

I could do something Mastodon-specific and parse content as HTML and search for class="hashtag" or rel="tag", but I'd still have to map the umlaut text there to the plain Latin text in tag.name, but that's a proprietary special that I'd rather avoid. Or I could ignore tags entirely and only look at the parsed HTML, but that's even more proprietary. Hrm.

snarfed commented 1 month ago

More details on Mastodon's behavior here in https://github.com/mastodon/mastodon/issues/26518 . No response from their team though.

MS-potilas commented 1 month ago

FYI, it looks like this is is fixed in Iceshrimp https://bsky.app/profile/AlderForrest.1m2lab.anvil.top.ap.brid.gy/post/3l25re3eiu7c2 as the hashtag #härkis is working. https://1m2lab.anvil.top/

snarfed commented 1 month ago

@MS-potilas nice! Or maybe it always worked in Iceshrimp? Here are the key parts of the AS2 for that post:

  "content": "<p><span>h\u00e4rkisdolmiospagettikastike. Ehdottomasti jatkoon!<br><br></span><a href=\"https://1m2lab.anvil.top/tags/h\u00e4rkis\" rel=\"tag\">#h\u00e4rkis</a></p>",
  "tag": [{
      "type": "Hashtag",
      "href": "https://1m2lab.anvil.top/tags/h%C3%A4rkis",
      "name": "#h\u00e4rkis"
    }]

Unlike Mastodon, Iceshrimp preserves the ä in the tag's name, so Bridgy Fed is able to translate it.

MS-potilas commented 1 month ago

Ah, I thought Iceshrimp is a Mastodon fork, but it is a Misskey fork, so maybe it did work from the beginning.

MS-potilas commented 1 month ago

What if we searched content with umlauts removed to get the indices, those indices will work also with the original content with umlauts. Simpler than parsing the content tags etc. This of course only in Mastodon. Just a thought.

snarfed commented 1 month ago

Sadly Bluesky facet indices are bytes, not characters/graphemes, so they won't match. Eg a is one byte, ä is two.