Open XueSheng-GIT opened 4 months ago
On the first view, it seems wildcard quries (especially with a leading wildcards) are not recommended (potential slow search performance). The ngram/edge_ngram tokenizer seems to be preferred for this case.
To keep things simple, I first looked into how to get the content field added to the search query and how to be able to define the wildcards yourself. Changing the tokenizer (which would require a re-indexing and also adapted search query) should imho be a long term goal.
1. Add content field to search query (as wildcard)
Wildcard field seem to be added here.
Adding the content field to this function seems to do the trick:
diff --git a/lib/Service/SearchService.php b/lib/Service/SearchService.php
index 333dfba..d2f62ec 100644
--- a/lib/Service/SearchService.php
+++ b/lib/Service/SearchService.php
@@ -128,6 +128,7 @@ private function searchQueryShareNames(ISearchRequest $request) {
$request->addField('share_names.' . $username);
$request->addWildcardField('title');
+ $request->addWildcardField('content');
$request->addWildcardField('share_names.' . $username);
}
2. Respect wildcards entered in search field
Predefined wildcards seem to be added here.
Following change does check for existing wildcards and avoids adding additional wildcards in this case.
diff --git a/lib/Service/SearchMappingService.php b/lib/Service/SearchMappingService.php
index f24c2abf..b8b42ac0 100644
--- a/lib/Service/SearchMappingService.php
+++ b/lib/Service/SearchMappingService.php
@@ -274,8 +274,13 @@ private function generateQueryContentFields(ISearchRequest $request, QueryConten
}
foreach ($request->getWildcardFields() as $field) {
+ $word = $content->getWord();
if (!$this->fieldIsOutLimit($request, $field)) {
- $queryFields[] = ['wildcard' => [$field => '*' . $content->getWord() . '*']];
+ if (strpos($word, '*') !== false || strpos($word, '?') !== false) {
+ $queryFields[] = ['wildcard' => [$field => $word]];
+ } else {
+ $queryFields[] = ['wildcard' => [$field => '*' . $word . '*']];
+ }
}
}
After applying those changes, files are found as expected (issue mentioned in the original post solved). I tried this on three instances I'm running and wasn't able to notice any practical performance impact (of course that's not representative in any way 😉... expecially, as I didn't mention any details about the size of the indexes involved).
I'm quite sure that wildcard search was working for the content field a couple of years ago (at least I created some personal documentation with wildcard search examples which stopped working at some point). Thus, it is possible that this function was disabled by intention. It could also be that it was just disabled unintentionally. If performance impact is a general concern, wildcard search within the content field could be an option.
@R0Wi Do you have any insights in this regards? Any suggestion/alternative approach how to solve this issue?
Hey @XueSheng-GIT, thanks for the comprehensive insights - really impressive :+1: Unfortunately, I don't have too much historical knowledge about the content
field being removed from the wildcard search. But I also remember that this was possible in earlier versions, so for advandced users this will definitely be helpful. We might want to keep @ArtificialOwl in the loop, maybe he has some more info for us.
From my point of view you did a pretty well research and the technical solution looks good to me. Maybe we could think about making the wildcard search in content
configurable via settings to avoid any performance bottlenecks for users/admins who don't want to use this feature? Also, in your initial post you provided the full JSON body being created by the app. I'd be interested in how this body looks like now, after applying your adjustments. Maybe you could give us some example here as well?
@R0Wi thanks for your quick reply! Some examples for the updated JSON body after patches applied (https://github.com/nextcloud/fulltextsearch_elasticsearch/issues/379#issuecomment-2227238970). All those examples do match the initially mentioned example and the related file is presented as result. This is not the case for the default (unpatched) fulltextsearch.
1. Search term: Freunde
2. Search term: *Freunde
3. Search term: +*ruppigen +Barbar*
OPTION_MUST
4. Search term: +"Barbaren waren" +??ruppigen
OPTION_MUST
.@ArtificialOwl Do you have any insights, why content
field is not part of the wildcard search? Any recommendation how to proceed with this topic? As mentioned by @R0Wi, an additional setting could be an option.
Description When searching, only the fields
title
andshare_names.user
are considered for wildcard search. It's not possible to use wildcard search for the content of files. Especially for languages like German, it's hard to find something because a lot of words are joined to one word (in my example I used the Word "Barbarenfreunde" and I'm searching for "Freunde"). In addition, the current wildcard search does only use a fixed leading/following*
(wildcard search only looks for*freunde*
in title and share_names). It's not possible to define the available elasticsearch wildcards*
and?
yourself.Steps to reproduce:
Search query is shown below (at the bottom of this issue).
Expected behaviour Search result should show the above created file.
Actual behaviour Search result does not show the above created file.
System details OS: Ubuntu 22.04 LTS Nextcloud: 29.0.3 Elasticsearch: 8.14.2 Fulltextsearch: 29.0.0 Fulltextsearch_Elasticsearch: 29.0.1 Files_Fulltextsearch: 29.0.0
Search query created by nextcloud: