Open jnweiger opened 1 year ago
This does not happen with the Legacy connector.
This is known.
In order to search substrings efficiently, we have to split the string. We tokenize the string in groups of 2 and 3 chars. For example, Photos
will be tokenized as Ph
, ho
, ot
, to
, os
, Pho
, hot
, oto
, tos
. Prefixes will get additional tokens from another tokenizer configured, so we have extra tokens for Ph
and Pho
. Note that this applies per word in the file name.
The input will be tokenized the same way.
For example: Example.odt
should be tokenized as Ex
, xa
, am
, mp
, pl
, le
, od
, dt
, Exa
, xam
, amp
, mpl
, ple
, odt
. If we apply the same to the fluxcomp
input, one of the tokens we get is mp
, which matches the a token of the Example.odt
. That's why that result appears.
For the result of fluxcom
, I guess it's a bug in the web UI. Double check that the search input is sent correctly to the server. Sometimes it seems it doesn't send the input, or it sends the previous input.
I guess we have to match all the tokens generated from the input in order to provide a match, not just any of them. I'll check alternatives.
This is known. In order to search substrings efficiently, we have to split the string. We tokenize the string in groups of 2 and 3 chars. For example,
Photos
will be tokenized asPh
,ho
,ot
,to
,os
,Pho
,hot
,oto
,tos
. Prefixes will get additional tokens from another tokenizer configured, so we have extra tokens forPh
andPho
. Note that this applies per word in the file name. The input will be tokenized the same way.For example:
Example.odt
should be tokenized asEx
,xa
,am
,mp
,pl
,le
,od
,dt
,Exa
,xam
,amp
,mpl
,ple
,odt
. If we apply the same to thefluxcomp
input, one of the tokens we get ismp
, which matches the a token of theExample.odt
. That's why that result appears.For the result of
fluxcom
, I guess it's a bug in the web UI. Double check that the search input is sent correctly to the server. Sometimes it seems it doesn't send the input, or it sends the previous input.I guess we have to match all the tokens generated from the input in order to provide a match, not just any of them. I'll check alternatives.
Yes, any is basically a boolean OR. That is very different from the semantics of matching the entire word. Matching all tokens (as in boolean AND) comes closer, but still is not identical to the semantics of matching the entire word.
If I understand the tokenization of Photos
correctly, then this will (even with boolean AND) happily match Pho-Bo soup on hot motor in toscana
or anything long enough to contain all the tokens.
This kind of tokenization seems to be a flawed approach.
I'm checking a way to compare against all the tokens, which should give better results, but even with that there will be false positives. It could be fine if the false positives appear at the end of the results though.
Added a new commit in https://github.com/owncloud/search_elastic/pull/319. It should work better now.
There have been some changes in the indexes, you need to either use a fresh elasticsearch instance, or reset and rebuild the indexes (occ search:index:reset
and then occ search:index:rebuild
)
Re-tested with todays commits in #319 and a fresh instance.
Rathsbergstr
finds the 6 pdf skeleton files, that have this word near the end.
Raths
Raths*
*ill*
fluxcompensator
-> no more "random matches" from unrelated files. GOOD. I assume it is intentional, that * needs to be appended for a prefix match? Correct?
boolean logic with OR no longer works.
rath* OR owncloud
rath OR owncloud
rath* AND NOT owncloud
Expected behaviour:
OR
operator should work, if explicitly given.AND
is okay, if no operator is specified. I assume it is intentional, that * needs to be appended for a prefix match? Correct?
I don't expect users to enter the "*" char. We're searching exact words in the content of the files, that's why you don't find "Rathsbergstr" in the content when searching for "Raths". I think using wildcards, in general, has a bad performance, so we aren't using them implicitly.
boolean logic with OR no longer works.
Yes, I've noticed. I don't have a good explanation on this one other than it has to do with the way we're searching with the connector.
It might be caused by the minimum_should_match
value set by the search, but changing it will likely affect the substring searches in the names.
For an usability perspective, I think it's fine because if you're looking for a file, each term should help you to narrow down the list.
We probably have to choose among having substring matches in the names, boolean operators and searches by specific fields, and the last one seems too good to let go. If substring matches are a requirement, we'll probably have to drop the boolean operators (unless someone knows how to fix it)
I've removed the minimum_should_match
from the search, so hopefully it works now.
It's a change in the search query, so there is no need to reindex everything. It should work with the data that is already indexed (since the last change yesterday.)
This should fix the OR / AND logic: https://github.com/owncloud/search_elastic/pull/319/commits/a671d7b18a1d7a8c004dc2e83e380171ea44619a Needs testing, then merge and release.
Seen with search-elastic-230+refactor-connector branch 20230714
occ config:app:set search_elastic connectors_search --value "RelevanceV2"
fluxcompensator
fluxcompen
fluxcom
flux
The number of hits decreases, when a shortened term is searched. BAD fluxcompensator does not occur in any of the 15 matches. BAD
Expected behaviour: