nextcloud / fulltextsearch_elasticsearch

🔍 Use Elasticsearch to index the content of your Nextcloud
https://apps.nextcloud.com/apps/fulltextsearch_elasticsearch
GNU Affero General Public License v3.0
82 stars 32 forks source link

Wildcard search not available for content field #379

Open XueSheng-GIT opened 4 months ago

XueSheng-GIT commented 4 months ago

Description When searching, only the fields title and share_names.user are considered for wildcard search. It's not possible to use wildcard search for the content of files. Especially for languages like German, it's hard to find something because a lot of words are joined to one word (in my example I used the Word "Barbarenfreunde" and I'm searching for "Freunde"). In addition, the current wildcard search does only use a fixed leading/following * (wildcard search only looks for *freunde* in title and share_names). It's not possible to define the available elasticsearch wildcards * and ? yourself.

Steps to reproduce:

  1. Create a new markdown file (keep default filename. it should not contain any text of the following content).
  2. Add content

    Aber die Barbaren waren stark behaart und hatten alle einen struppigen Barbarenbart (gar nicht apart), daraufhin schickte Barbara ihre Barbarenfreunde zum Barbarenbartbarbier

  3. Close file and let nextcloud index its content
  4. Open search in Nextcloud webif and enter one of the following terms:

    Freunde
    *Freunde

Search query is shown below (at the bottom of this issue).

Expected behaviour Search result should show the above created file.

Actual behaviour Search result does not show the above created file.

System details OS: Ubuntu 22.04 LTS Nextcloud: 29.0.3 Elasticsearch: 8.14.2 Fulltextsearch: 29.0.0 Fulltextsearch_Elasticsearch: 29.0.1 Files_Fulltextsearch: 29.0.0

Search query created by nextcloud:

{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": [
            {
              "match_phrase_prefix": {
                "content": "*freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "title": "*freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "share_names.admin": "*freunde"
              }
            },
            {
              "wildcard": {
                "title": "**freunde*"
              }
            },
            {
              "wildcard": {
                "share_names.admin": "**freunde*"
              }
            },
            {
              "query_string": {
                "fields": [
                  "parts.comments"
                ],
                "query": "*freunde"
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "__all"
                }
              },
              {
                "term": {
                  "groups.keyword": "admin"
                }
              },
              {
                "term": {
                  "groups.keyword": "beta"
                }
              },
              {
                "term": {
                  "groups.keyword": "home"
                }
              },
              {
                "term": {
                  "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl"
                }
              },
              {
                "term": {
                  "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4"
                }
              },
              {
                "term": {
                  "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}
XueSheng-GIT commented 4 months ago

On the first view, it seems wildcard quries (especially with a leading wildcards) are not recommended (potential slow search performance). The ngram/edge_ngram tokenizer seems to be preferred for this case.

To keep things simple, I first looked into how to get the content field added to the search query and how to be able to define the wildcards yourself. Changing the tokenizer (which would require a re-indexing and also adapted search query) should imho be a long term goal.

1. Add content field to search query (as wildcard)

Wildcard field seem to be added here.

Adding the content field to this function seems to do the trick:

diff --git a/lib/Service/SearchService.php b/lib/Service/SearchService.php
index 333dfba..d2f62ec 100644
--- a/lib/Service/SearchService.php
+++ b/lib/Service/SearchService.php
@@ -128,6 +128,7 @@ private function searchQueryShareNames(ISearchRequest $request) {
        $request->addField('share_names.' . $username);

        $request->addWildcardField('title');
+       $request->addWildcardField('content');
        $request->addWildcardField('share_names.' . $username);
    }

2. Respect wildcards entered in search field

Predefined wildcards seem to be added here.

Following change does check for existing wildcards and avoids adding additional wildcards in this case.

diff --git a/lib/Service/SearchMappingService.php b/lib/Service/SearchMappingService.php
index f24c2abf..b8b42ac0 100644
--- a/lib/Service/SearchMappingService.php
+++ b/lib/Service/SearchMappingService.php
@@ -274,8 +274,13 @@ private function generateQueryContentFields(ISearchRequest $request, QueryConten
        }

        foreach ($request->getWildcardFields() as $field) {
+           $word = $content->getWord();
            if (!$this->fieldIsOutLimit($request, $field)) {
-               $queryFields[] = ['wildcard' => [$field => '*' . $content->getWord() . '*']];
+               if (strpos($word, '*') !== false || strpos($word, '?') !== false) {
+                   $queryFields[] = ['wildcard' => [$field => $word]];
+               } else {
+                   $queryFields[] = ['wildcard' => [$field => '*' . $word . '*']];
+               }
            }
        }

After applying those changes, files are found as expected (issue mentioned in the original post solved). I tried this on three instances I'm running and wasn't able to notice any practical performance impact (of course that's not representative in any way 😉... expecially, as I didn't mention any details about the size of the indexes involved).

I'm quite sure that wildcard search was working for the content field a couple of years ago (at least I created some personal documentation with wildcard search examples which stopped working at some point). Thus, it is possible that this function was disabled by intention. It could also be that it was just disabled unintentionally. If performance impact is a general concern, wildcard search within the content field could be an option.

@R0Wi Do you have any insights in this regards? Any suggestion/alternative approach how to solve this issue?

R0Wi commented 4 months ago

Hey @XueSheng-GIT, thanks for the comprehensive insights - really impressive :+1: Unfortunately, I don't have too much historical knowledge about the content field being removed from the wildcard search. But I also remember that this was possible in earlier versions, so for advandced users this will definitely be helpful. We might want to keep @ArtificialOwl in the loop, maybe he has some more info for us.

From my point of view you did a pretty well research and the technical solution looks good to me. Maybe we could think about making the wildcard search in content configurable via settings to avoid any performance bottlenecks for users/admins who don't want to use this feature? Also, in your initial post you provided the full JSON body being created by the app. I'd be interested in how this body looks like now, after applying your adjustments. Maybe you could give us some example here as well?

XueSheng-GIT commented 4 months ago

@R0Wi thanks for your quick reply! Some examples for the updated JSON body after patches applied (https://github.com/nextcloud/fulltextsearch_elasticsearch/issues/379#issuecomment-2227238970). All those examples do match the initially mentioned example and the related file is presented as result. This is not the case for the default (unpatched) fulltextsearch.

1. Search term: Freunde

Show JSON body ``` { "query": { "bool": { "must": { "bool": { "should": [ { "match_phrase_prefix": { "content": "freunde" } }, { "match_phrase_prefix": { "title": "freunde" } }, { "match_phrase_prefix": { "share_names.admin": "freunde" } }, { "wildcard": { "title": "*freunde*" } }, { "wildcard": { "content": "*freunde*" } }, { "wildcard": { "share_names.admin": "*freunde*" } }, { "query_string": { "fields": [ "parts.comments" ], "query": "freunde" } } ] } }, "filter": [ { "bool": { "must": { "term": { "provider": "files" } } } }, { "bool": { "should": [ { "term": { "owner.keyword": "admin" } }, { "term": { "users.keyword": "admin" } }, { "term": { "users.keyword": "__all" } }, { "term": { "groups.keyword": "admin" } }, { "term": { "groups.keyword": "beta" } }, { "term": { "groups.keyword": "home" } }, { "term": { "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl" } }, { "term": { "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4" } }, { "term": { "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf" } } ] } }, { "bool": { "should": [] } }, { "bool": { "must": [] } }, { "bool": { "must": [] } } ] } }, "highlight": { "fields": { "content": {}, "parts.comments": {} }, "pre_tags": [ "" ], "post_tags": [ "" ] } } ```

2. Search term: *Freunde

Show JSON body ``` { "query": { "bool": { "must": { "bool": { "should": [ { "match_phrase_prefix": { "content": "*freunde" } }, { "match_phrase_prefix": { "title": "*freunde" } }, { "match_phrase_prefix": { "share_names.admin": "*freunde" } }, { "wildcard": { "title": "*freunde" } }, { "wildcard": { "content": "*freunde" } }, { "wildcard": { "share_names.admin": "*freunde" } }, { "query_string": { "fields": [ "parts.comments" ], "query": "*freunde" } } ] } }, "filter": [ { "bool": { "must": { "term": { "provider": "files" } } } }, { "bool": { "should": [ { "term": { "owner.keyword": "admin" } }, { "term": { "users.keyword": "admin" } }, { "term": { "users.keyword": "__all" } }, { "term": { "groups.keyword": "admin" } }, { "term": { "groups.keyword": "beta" } }, { "term": { "groups.keyword": "home" } }, { "term": { "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl" } }, { "term": { "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4" } }, { "term": { "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf" } } ] } }, { "bool": { "should": [] } }, { "bool": { "must": [] } }, { "bool": { "must": [] } } ] } }, "highlight": { "fields": { "content": {}, "parts.comments": {} }, "pre_tags": [ "" ], "post_tags": [ "" ] } } ```

3. Search term: +*ruppigen +Barbar*

Show JSON body ``` { "query": { "bool": { "must": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase_prefix": { "content": "*ruppigen" } }, { "match_phrase_prefix": { "title": "*ruppigen" } }, { "match_phrase_prefix": { "share_names.admin": "*ruppigen" } }, { "wildcard": { "title": "*ruppigen" } }, { "wildcard": { "content": "*ruppigen" } }, { "wildcard": { "share_names.admin": "*ruppigen" } }, { "query_string": { "fields": [ "parts.comments" ], "query": "*ruppigen" } } ] } }, { "bool": { "should": [ { "match_phrase_prefix": { "content": "barbar*" } }, { "match_phrase_prefix": { "title": "barbar*" } }, { "match_phrase_prefix": { "share_names.admin": "barbar*" } }, { "wildcard": { "title": "barbar*" } }, { "wildcard": { "content": "barbar*" } }, { "wildcard": { "share_names.admin": "barbar*" } }, { "query_string": { "fields": [ "parts.comments" ], "query": "barbar*" } } ] } } ] } }, "filter": [ { "bool": { "must": { "term": { "provider": "files" } } } }, { "bool": { "should": [ { "term": { "owner.keyword": "admin" } }, { "term": { "users.keyword": "admin" } }, { "term": { "users.keyword": "__all" } }, { "term": { "groups.keyword": "admin" } }, { "term": { "groups.keyword": "beta" } }, { "term": { "groups.keyword": "home" } }, { "term": { "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl" } }, { "term": { "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4" } }, { "term": { "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf" } } ] } }, { "bool": { "should": [] } }, { "bool": { "must": [] } }, { "bool": { "must": [] } } ] } }, "highlight": { "fields": { "content": {}, "parts.comments": {} }, "pre_tags": [ "" ], "post_tags": [ "" ] } } ```

4. Search term: +"Barbaren waren" +??ruppigen

Show JSON body ``` { "query": { "bool": { "must": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase_prefix": { "content": "barbaren waren" } }, { "match_phrase_prefix": { "title": "barbaren waren" } }, { "match_phrase_prefix": { "share_names.admin": "barbaren waren" } }, { "wildcard": { "title": "*barbaren waren*" } }, { "wildcard": { "content": "*barbaren waren*" } }, { "wildcard": { "share_names.admin": "*barbaren waren*" } }, { "query_string": { "fields": [ "parts.comments" ], "query": "barbaren waren" } } ] } }, { "bool": { "should": [ { "match_phrase_prefix": { "content": "??ruppigen" } }, { "match_phrase_prefix": { "title": "??ruppigen" } }, { "match_phrase_prefix": { "share_names.admin": "??ruppigen" } }, { "wildcard": { "title": "??ruppigen" } }, { "wildcard": { "content": "??ruppigen" } }, { "wildcard": { "share_names.admin": "??ruppigen" } }, { "query_string": { "fields": [ "parts.comments" ], "query": "??ruppigen" } } ] } } ] } }, "filter": [ { "bool": { "must": { "term": { "provider": "files" } } } }, { "bool": { "should": [ { "term": { "owner.keyword": "admin" } }, { "term": { "users.keyword": "admin" } }, { "term": { "users.keyword": "__all" } }, { "term": { "groups.keyword": "admin" } }, { "term": { "groups.keyword": "beta" } }, { "term": { "groups.keyword": "home" } }, { "term": { "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl" } }, { "term": { "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4" } }, { "term": { "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf" } } ] } }, { "bool": { "should": [] } }, { "bool": { "must": [] } }, { "bool": { "must": [] } } ] } }, "highlight": { "fields": { "content": {}, "parts.comments": {} }, "pre_tags": [ "" ], "post_tags": [ "" ] } } ```
XueSheng-GIT commented 2 months ago

@ArtificialOwl Do you have any insights, why content field is not part of the wildcard search? Any recommendation how to proceed with this topic? As mentioned by @R0Wi, an additional setting could be an option.