[QA] too many matches for `fluxcompensator`

jnweiger commented 1 year ago

Seen with search-elastic-230+refactor-connector branch 20230714

occ config:app:set search_elastic connectors_search --value "RelevanceV2"

fluxcompensator

fluxcompen

fluxcom

flux

The number of hits decreases, when a shortened term is searched. BAD fluxcompensator does not occur in any of the 15 matches. BAD

Expected behaviour:

number of hits should increase, when a shorter prefix is searched.
no false positives.

jnweiger commented 1 year ago

This does not happen with the Legacy connector.

jvillafanez commented 1 year ago

This is known. In order to search substrings efficiently, we have to split the string. We tokenize the string in groups of 2 and 3 chars. For example, Photos will be tokenized as Ph, ho, ot, to, os, Pho, hot, oto, tos. Prefixes will get additional tokens from another tokenizer configured, so we have extra tokens for Ph and Pho. Note that this applies per word in the file name. The input will be tokenized the same way.

For example: Example.odt should be tokenized as Ex, xa, am, mp, pl, le, od, dt, Exa, xam, amp, mpl, ple, odt. If we apply the same to the fluxcomp input, one of the tokens we get is mp, which matches the a token of the Example.odt. That's why that result appears.

For the result of fluxcom, I guess it's a bug in the web UI. Double check that the search input is sent correctly to the server. Sometimes it seems it doesn't send the input, or it sends the previous input.

I guess we have to match all the tokens generated from the input in order to provide a match, not just any of them. I'll check alternatives.

jnweiger commented 1 year ago

This is known. In order to search substrings efficiently, we have to split the string. We tokenize the string in groups of 2 and 3 chars. For example, Photos will be tokenized as Ph, ho, ot, to, os, Pho, hot, oto, tos. Prefixes will get additional tokens from another tokenizer configured, so we have extra tokens for Ph and Pho. Note that this applies per word in the file name. The input will be tokenized the same way.

For example: Example.odt should be tokenized as Ex, xa, am, mp, pl, le, od, dt, Exa, xam, amp, mpl, ple, odt. If we apply the same to the fluxcomp input, one of the tokens we get is mp, which matches the a token of the Example.odt. That's why that result appears.

For the result of fluxcom, I guess it's a bug in the web UI. Double check that the search input is sent correctly to the server. Sometimes it seems it doesn't send the input, or it sends the previous input.

I guess we have to match all the tokens generated from the input in order to provide a match, not just any of them. I'll check alternatives.

Yes, any is basically a boolean OR. That is very different from the semantics of matching the entire word. Matching all tokens (as in boolean AND) comes closer, but still is not identical to the semantics of matching the entire word.

If I understand the tokenization of Photos correctly, then this will (even with boolean AND) happily match Pho-Bo soup on hot motor in toscana or anything long enough to contain all the tokens.

This kind of tokenization seems to be a flawed approach.

jvillafanez commented 1 year ago

I'm checking a way to compare against all the tokens, which should give better results, but even with that there will be false positives. It could be fine if the false positives appear at the end of the results though.

jvillafanez commented 1 year ago

Added a new commit in https://github.com/owncloud/search_elastic/pull/319. It should work better now. There have been some changes in the indexes, you need to either use a fresh elasticsearch instance, or reset and rebuild the indexes (occ search:index:reset and then occ search:index:rebuild)

jnweiger commented 1 year ago

Re-tested with todays commits in #319 and a fresh instance.

Standard skeleton files, and
adding one text file jabber.txt with content "twas brilling and the slithy toves did gyre and gimble in the wabe, all mimsey were the borogoves and the mome raths outgrabe."

`Rathsbergstr`

finds the 6 pdf skeleton files, that have this word near the end.

`Raths`

finds the text snippet in jabber.txt OK
no longer finds the Rathsbergstr in the PDFs. OKish

`Raths*`

finds both, Rathsbergstr, and mome raths outgrabe. OK

`ill`

finds will, still, illict, an brilling. OK

`fluxcompensator`

no results. OK

-> no more "random matches" from unrelated files. GOOD. I assume it is intentional, that * needs to be appended for a prefix match? Correct?

jnweiger commented 1 year ago

boolean logic with OR no longer works.

`rath* OR owncloud`

finds the six PDF files with ownCloud and Rathsbergstr. - OK
does not find the mome raths in jabber.txt - BAD.

`rath OR owncloud`

finds nothing.
it should still find the six PDFs that contain ownCloud

`rath* AND NOT owncloud`

finds the mome raths in jabber.txt, but no PDFs. So the boolean operators AND, and AND NOT work correctly.

Expected behaviour:

the OR operator should work, if explicitly given.
default AND is okay, if no operator is specified.

jvillafanez commented 1 year ago

I assume it is intentional, that * needs to be appended for a prefix match? Correct?

I don't expect users to enter the "*" char. We're searching exact words in the content of the files, that's why you don't find "Rathsbergstr" in the content when searching for "Raths". I think using wildcards, in general, has a bad performance, so we aren't using them implicitly.

boolean logic with OR no longer works.

Yes, I've noticed. I don't have a good explanation on this one other than it has to do with the way we're searching with the connector. It might be caused by the minimum_should_match value set by the search, but changing it will likely affect the substring searches in the names.

For an usability perspective, I think it's fine because if you're looking for a file, each term should help you to narrow down the list.

We probably have to choose among having substring matches in the names, boolean operators and searches by specific fields, and the last one seems too good to let go. If substring matches are a requirement, we'll probably have to drop the boolean operators (unless someone knows how to fix it)

jvillafanez commented 1 year ago

I've removed the minimum_should_match from the search, so hopefully it works now. It's a change in the search query, so there is no need to reindex everything. It should work with the data that is already indexed (since the last change yesterday.)

jnweiger commented 1 year ago

This should fix the OR / AND logic: https://github.com/owncloud/search_elastic/pull/319/commits/a671d7b18a1d7a8c004dc2e83e380171ea44619a Needs testing, then merge and release.

owncloud / search_elastic