mastodon / documentation

Mastodon documentation
https://docs.joinmastodon.org
GNU Free Documentation License v1.3
1.7k stars 974 forks source link

Instruction for ElasticSearch setup for Chinese support no longer current #1428

Open mhkhung opened 5 months ago

mhkhung commented 5 months ago

Steps to reproduce the problem

Try to follow the setup here: https://docs.joinmastodon.org/admin/elasticsearch/#search-optimization-for-other-languages

The current code does not match the diff.

Expected behaviour

Docs can be followed

Actual behaviour

Diff no longer valid

Detailed description

The diff is no longer current. It's unclear how this can be fixed and to fix existing indexes. Also, code-level patch is very undesired for administrators - does the patch need to be there all the time or just when the index is created? I do not want to maintain a fork of the code with all the recent security issues - can't this be handled with code/config?

Mastodon instance

No response

Mastodon version

main-latest

Technical details

If this is happening on your own Mastodon server, please fill out those:

mogita commented 3 days ago

For what it's worth, here's the patch I came down from v4.2.12. Not a pro of ElasticSearch here, just copied everything from the current docs and it got my server (kind of) working.

BTW I'm using it in my mastodon devops setup, link for anyone who's interested.

diff --git a/app/chewy/accounts_index.rb b/app/chewy/accounts_index.rb
--- a/app/chewy/accounts_index.rb
+++ b/app/chewy/accounts_index.rb
@@ -23,7 +23,7 @@ class AccountsIndex < Chewy::Index

     analyzer: {
       natural: {
-        tokenizer: 'standard',
+        tokenizer: 'ik_max_word',
         filter: %w(
           lowercase
           asciifolding
@@ -36,7 +36,7 @@ class AccountsIndex < Chewy::Index
       },

       verbatim: {
-        tokenizer: 'standard',
+        tokenizer: 'ik_max_word',
         filter: %w(lowercase asciifolding cjk_width),
       },

diff --git a/app/chewy/statuses_index.rb b/app/chewy/statuses_index.rb
--- a/app/chewy/statuses_index.rb
+++ b/app/chewy/statuses_index.rb
@@ -21,14 +21,23 @@ class StatusesIndex < Chewy::Index
       },
     },

+    char_filter: {
+      tsconvert: {
+        type: 'stconvert',
+        keep_both: false,
+        delimiter: '#',
+        convert_type: 't2s',
+      },
+    },
+
     analyzer: {
       verbatim: {
-        tokenizer: 'uax_url_email',
+        tokenizer: 'ik_max_word',
         filter: %w(lowercase),
       },

       content: {
-        tokenizer: 'standard',
+        tokenizer: 'ik_max_word',
         filter: %w(
           lowercase
           asciifolding
@@ -38,6 +47,7 @@ class StatusesIndex < Chewy::Index
           english_stop
           english_stemmer
         ),
+        char_filter: %w(tsconvert),
       },

       hashtag: {
diff --git a/app/chewy/tags_index.rb b/app/chewy/tags_index.rb
--- a/app/chewy/tags_index.rb
+++ b/app/chewy/tags_index.rb
@@ -4,15 +4,25 @@ class TagsIndex < Chewy::Index
   include DatetimeClampingConcern

   settings index: index_preset(refresh_interval: '30s'), analysis: {
+    char_filter: {
+      tsconvert: {
+        type: 'stconvert',
+        keep_both: false,
+        delimiter: '#',
+        convert_type: 't2s',
+      },
+    },
+
     analyzer: {
       content: {
-        tokenizer: 'keyword',
+        tokenizer: 'ik_max_word',
         filter: %w(
           word_delimiter_graph
           lowercase
           asciifolding
           cjk_width
         ),
+        char_filter: %w(tsconvert),
       },

       edge_ngram: {