mastodon / mastodon

Your self-hosted, globally interconnected microblogging community
https://joinmastodon.org
GNU Affero General Public License v3.0
47.22k stars 7k forks source link

search hashtags results and trending hastags are inconsistent #18031

Open tassoman opened 2 years ago

tassoman commented 2 years ago

Steps to reproduce the problem

  1. search for a CamelCaseHashtag using lowercase characters
  2. you get all kinds of camels for the same hashtag 🐫 🐪
  3. then watch this screenshot of trending hashtags I've just taken in a v.3.5.1 instance

Screenshot 2022-04-13 at 23-26-10 Mastodon Italia Social

Expected behaviour

trending stats should be aggregated and lowercase

Actual behaviour

trending stats of the same hastag are divided by camels

Specifications

Browser: Firefox OS: Windows

tassoman commented 2 years ago

A few years ago, you decided for insensitive hashtags ( #3761 )

Gargron commented 2 years ago

You can see the i is different between the two hashtags.

tassoman commented 2 years ago

Yes, sorry, I've just seen now. 👀 If you don't mind to transliterate maybe we can ignore this issue... 🤔 Searches got transliterated, already

filippodb commented 2 years ago

You can see the i is different between the two hashtags.

Now we have this popular hashtag with the word "Mercoledì" (wednesday) that is happening every week so it's quite a mess because somebody use the common I others ì. It would be really good to have ì transliterated to i. twitter is returning tweets with both #mercoledi & #mercoledì:

https://twitter.com/hashtag/mercoledi

there're other vocals that has to be transliterated:

è é => e ù => u à => a ò => o

and those are only for italian language.

tassoman commented 2 years ago

Sorry for being an ugly person, I've found something using javascript on the StackOverflow: https://stackoverflow.com/a/2128054 Maybe using the Django urlify javascript solution could work?

I found the data provider for trending tags, by reading the relative javascript action

So I bet the /api/v1/trends/tags rest resource controller, should transliterate data output.

{
"name":"crushDelMercoledì",
"url":"https://mastodon.uno/tags/crushDelMercoled%C3%AC",
"history":[{"day":"1649894400","accounts":"1","uses":"1"}]
}
tassoman commented 2 years ago

if you search for "crush", you can only see the translitered result (having itself wrong accounts sums, 6 and 8) ... ❓

Screenshot 2022-04-14 at 22-45-08 Mastodon Italia Social

Then, in the tags pages, we have both entries, with different toots:

filippodb commented 2 years ago

The same problem happen with toots about the Icelandic singer Björk due to missing "ö" in our italian Desktop keyboard, so on mastodon we have different timelines about the same artist:

https://mastodon.uno/tags/Bjork

https://mastodon.uno/tags/björk

Gargron commented 2 years ago

Search index in Elasticsearch uses ascii normalization, but the database doesn’t. It’s not trivial to update the database schema in this case which is why it hasn’t been done/prioritized, but it would be very nice to do.

tassoman commented 2 years ago

I had the same intuition, more, I can't do anything on Ruby.

I think this issue can be collected for future use, if a mayor refactoring is going to be planned.

What if using elasticsearch for hashtags also?