Phrase search removes stopwords and then don't match exact phrase

danielbasso commented 3 weeks ago

Bug Description:

Issue Summary:

When performing a phrase search on a full-text field 'title' that includes stopwords, no results are returned even when an exact match exists in the data.

Detailed Description:

Let's suppose I have a full-text field called 'title' in my index. One of the documents has the title: "Apples and Banana"

When I execute a phrase search using: MATCH('@(title) "Apples and Banana"')

I expect to retrieve the document with that exact title. However, no results are returned.

After some experimentation, I found that searching for MATCH('@(title) "Apples Banana"') (removing the stopword 'and') yields the same result (no matches). Other cases behave similarly. This behavior suggests that Manticore is stripping out stopwords during phrase searches, which prevents exact phrase matches when stopwords are included.

Why This Is a Problem:

Phrase searches are intended to match exact sequences of words in the text, preserving both the order and the content—including stopwords. Removing stopwords before executing a phrase search undermines the purpose of an exact match search.

Expected Behavior:

The phrase search MATCH('@(title) "Apples and Banana"') should return the document with the title "Apples and Banana".

Manticore Search Version:

6.3.6

Operating System Version:

Any

Have you tried the latest development version?

No

Internal Checklist:

^{To be completed by the assignee. Check off tasks that have been completed or are not applicable.}

- [x] Implementation completed - [x] Tests developed - [x] Documentation updated - [x] Documentation reviewed - [x] [Changelog](https://docs.google.com/spreadsheets/d/1mz_3dRWKs86FjRF7EIZUziUDK_2Hvhd97G0pLpxo05s/edit?pli=1&gid=1102439133) updated

tomatolog commented 3 weeks ago

it could be better to provide the complete example to reproduce the issue here locally as we have test_17 that check exactly the same cases phrase matching vs stop words and all works fine there

danielbasso commented 3 weeks ago

@tomatolog no problem

Let's break it down on two examples. First one, without stopwords definition:

EXAMPLE 1: No stopword definition


DROP TABLE IF EXISTS test_table;

CREATE TABLE test_table(
    `id` int,
    `body` text stored indexed 
) 

INSERT INTO `test_table` VALUES
( 1, 'walking shoes' ),
( 2, 'try walking in my shoes' ),
( 3, 'Microsoft. The Office.' ),
( 4, 'Microsoft Office' ),
( 5, 'apple and oranges' );

SELECT * from test_table where MATCH ('@(body) "apple and oranges"');

Now, with stopwords:

EXAMPLE 2: With stopword definition


DROP TABLE IF EXISTS test_table;

CREATE TABLE test_table(
    `id` int,
    `body` text stored indexed 
) stopwords = 'en' morphology = 'lemmatize_en_all, libstemmer_en' html_strip = '1'  stopwords_unstemmed = '1' index_exact_words='1' blend_chars='+,&' rt_mem_limit='512M' 

INSERT INTO `test_table` VALUES
( 1, 'walking shoes' ),
( 2, 'try walking in my shoes' ),
( 3, 'Microsoft. The Office.' ),
( 4, 'Microsoft Office' ),
( 5, 'apple and oranges' );

SELECT * from test_table where MATCH ('@(body) "apple and oranges"');

sanikolaev commented 3 weeks ago

As noted in the documentation :

Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.

So pls consider using the stopword_step setting. Here's an example:

mysql> drop table if exists t; create table t(f text) stopwords='en' stopword_step='0'; insert into t values(1, 'apple and oranges'); select * from t where match('"apple and oranges"'); select * from t where match('"apple oranges"');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text) stopwords='en' stopword_step='0'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(1, 'apple and oranges')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('"apple and oranges"')
--------------

+------+-------------------+
| id   | f                 |
+------+-------------------+
|    1 | apple and oranges |
+------+-------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

--------------
select * from t where match('"apple oranges"')
--------------

+------+-------------------+
| id   | f                 |
+------+-------------------+
|    1 | apple and oranges |
+------+-------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

tomatolog commented 3 weeks ago

seems stopwords_unstemmed cause the issue as without this setting all works fine

CREATE TABLE test_table(
    `id` int,
    `body` text stored indexed 
) stopwords = 'en' morphology = 'lemmatize_en_all, libstemmer_en' html_strip = '1'  index_exact_words='1' blend_chars='+,&' rt_mem_limit='512M';

mysql> SELECT * from test_table where MATCH ('@(body) "apple and oranges"');
+------+-------------------+
| id   | body              |
+------+-------------------+
|    5 | apple and oranges |
+------+-------------------+
1 row in set (0.01 sec)
--- 1 out of 1 results in 6ms ---

danielbasso commented 3 weeks ago

Hi @sanikolaev

Understood, thanks for the clarification. The stopword_step setting is kinda abstract, but I got your point. But, as I see, if I use this setting, now stopwords will be kinda of 'ignored' in the search, producing side effects like the last example:

--------------
select * from t where match('"apple oranges"')
--------------

+------+-------------------+
| id   | f                 |
+------+-------------------+
|    1 | apple and oranges |
+------+-------------------+

Is there a way to match('"apple and oranges"') return only "apple and oranges" and match('"apple oranges"') return only "apple oranges" nothing less, nothing more?

Edit: I say this because I thought of replacing any stopword by '*' before the phrase search, but decided against it exactly because I didn't want any 'extra' results

danielbasso commented 3 weeks ago

seems stopwords_unstemmed cause the issue as without this setting all works fine

CREATE TABLE test_table(
  `id` int,
  `body` text stored indexed 
) stopwords = 'en' morphology = 'lemmatize_en_all, libstemmer_en' html_strip = '1'  index_exact_words='1' blend_chars='+,&' rt_mem_limit='512M';

mysql> SELECT * from test_table where MATCH ('@(body) "apple and oranges"');
+------+-------------------+
| id   | body              |
+------+-------------------+
|    5 | apple and oranges |
+------+-------------------+
1 row in set (0.01 sec)
--- 1 out of 1 results in 6ms ---

Hi @tomatolog, this scenario could be considered a bug? Complementing what wrote in my last reply to @sanikolaev, the ideal world would be:

1) Use stopwords_unstemmed to remove undesired effects; 2) match('"apple and oranges"') return only "apple and oranges", nothing less, nothing more; 3) match('"apple oranges"') return only "apple oranges" nothing less, nothing more

sanikolaev commented 3 weeks ago

now stopwords will be kinda of 'ignored' in the search

Well, that's the purpose of stopwords, isn't it? If you want and to be treated as a stopword in some cases and as a regular word in others, the only way to do it seems to be manually adjusting your search query. For example, when and is inside a phrase, don't include it; when it's outside a phrase, exclude it.

danielbasso commented 3 weeks ago

@sanikolaev, I think that's exactly what stopwords are for outside the context of exact phrase searches. For an exact search, though, I’d like stopwords to be treated just like any other word—which makes sense why there’s a test specifically for that

sanikolaev commented 3 weeks ago

For an exact search, though, I’d like stopwords to be treated just like any other word

This is not how stopwords work in Manticore. A stopword is removed both when indexing text and when making a search query, for custom behaviour, I'm afraid you have to exclude a stopword before passing the query to Manticore.

However, it seems like a bug to me that "apple and b" cannot find the document containing apple and b when these options are used:

mysql> drop table if exists t; create table t(f text) stopwords='en' morphology='lemmatize_en_all' index_exact_words='1'; insert into t values(1, 'apple and b'); select * from t where match('"apple and b"');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text) stopwords='en' morphology='lemmatize_en_all' index_exact_words='1'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(1, 'apple and b')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('"apple and b"')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

We'll look into this.

tomatolog commented 3 weeks ago

Hi @tomatolog, this scenario could be considered a bug?

I not sure about is it a bug or wrong combination of the options such as

morphology = 'lemmatize_en_all, libstemmer_en' stopwords_unstemmed = '1'

not sure why use lemmatize_en_all and libstemmer_en together along with the stopwords_unstemmed .

I need first to debug the case how it got processed from the indexing to query parsing and matching to make sure is it a bug and should be fixed or wrong options combo and the combo should be prohibited.

For now I'd reduce options count and make sure all your cases work fine.

tomatolog commented 2 weeks ago

I've just fixed the issue described here at https://github.com/manticoresoftware/manticoresearch/commit/d59d8d31a6411862cccc7e11d76ea5714f8de4e7

You need to install package from the dev repository after CI pass and reindex your data from scratch as indexer and daemon with the bug index data with wrong hit positions that is why all position related full-text operators failed to match such cases with the wrong indexed data.

danielbasso commented 1 week ago

@tomatolog

I don't know if we will be able to test it right away, but I'll keep you posted. Thanks a lot.

manticoresoftware / manticoresearch