Closed danielbasso closed 2 weeks ago
it could be better to provide the complete example to reproduce the issue here locally as we have test_17 that check exactly the same cases phrase matching vs stop words and all works fine there
@tomatolog no problem
Let's break it down on two examples. First one, without stopwords definition:
DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table(
`id` int,
`body` text stored indexed
)
INSERT INTO `test_table` VALUES
( 1, 'walking shoes' ),
( 2, 'try walking in my shoes' ),
( 3, 'Microsoft. The Office.' ),
( 4, 'Microsoft Office' ),
( 5, 'apple and oranges' );
SELECT * from test_table where MATCH ('@(body) "apple and oranges"');
Now, with stopwords:
DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table(
`id` int,
`body` text stored indexed
) stopwords = 'en' morphology = 'lemmatize_en_all, libstemmer_en' html_strip = '1' stopwords_unstemmed = '1' index_exact_words='1' blend_chars='+,&' rt_mem_limit='512M'
INSERT INTO `test_table` VALUES
( 1, 'walking shoes' ),
( 2, 'try walking in my shoes' ),
( 3, 'Microsoft. The Office.' ),
( 4, 'Microsoft Office' ),
( 5, 'apple and oranges' );
SELECT * from test_table where MATCH ('@(body) "apple and oranges"');
As noted in the documentation :
Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.
So pls consider using the stopword_step
setting. Here's an example:
mysql> drop table if exists t; create table t(f text) stopwords='en' stopword_step='0'; insert into t values(1, 'apple and oranges'); select * from t where match('"apple and oranges"'); select * from t where match('"apple oranges"');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
create table t(f text) stopwords='en' stopword_step='0'
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t values(1, 'apple and oranges')
--------------
Query OK, 1 row affected (0.01 sec)
--------------
select * from t where match('"apple and oranges"')
--------------
+------+-------------------+
| id | f |
+------+-------------------+
| 1 | apple and oranges |
+------+-------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---
--------------
select * from t where match('"apple oranges"')
--------------
+------+-------------------+
| id | f |
+------+-------------------+
| 1 | apple and oranges |
+------+-------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---
seems stopwords_unstemmed
cause the issue as without this setting all works fine
CREATE TABLE test_table(
`id` int,
`body` text stored indexed
) stopwords = 'en' morphology = 'lemmatize_en_all, libstemmer_en' html_strip = '1' index_exact_words='1' blend_chars='+,&' rt_mem_limit='512M';
mysql> SELECT * from test_table where MATCH ('@(body) "apple and oranges"');
+------+-------------------+
| id | body |
+------+-------------------+
| 5 | apple and oranges |
+------+-------------------+
1 row in set (0.01 sec)
--- 1 out of 1 results in 6ms ---
Hi @sanikolaev
Understood, thanks for the clarification. The stopword_step
setting is kinda abstract, but I got your point. But, as I see, if I use this setting, now stopwords will be kinda of 'ignored' in the search, producing side effects like the last example:
--------------
select * from t where match('"apple oranges"')
--------------
+------+-------------------+
| id | f |
+------+-------------------+
| 1 | apple and oranges |
+------+-------------------+
Is there a way to match('"apple and oranges"')
return only "apple and oranges" and match('"apple oranges"')
return only "apple oranges" nothing less, nothing more?
Edit: I say this because I thought of replacing any stopword by '*' before the phrase search, but decided against it exactly because I didn't want any 'extra' results
seems
stopwords_unstemmed
cause the issue as without this setting all works fineCREATE TABLE test_table( `id` int, `body` text stored indexed ) stopwords = 'en' morphology = 'lemmatize_en_all, libstemmer_en' html_strip = '1' index_exact_words='1' blend_chars='+,&' rt_mem_limit='512M'; mysql> SELECT * from test_table where MATCH ('@(body) "apple and oranges"'); +------+-------------------+ | id | body | +------+-------------------+ | 5 | apple and oranges | +------+-------------------+ 1 row in set (0.01 sec) --- 1 out of 1 results in 6ms ---
Hi @tomatolog, this scenario could be considered a bug? Complementing what wrote in my last reply to @sanikolaev, the ideal world would be:
1) Use stopwords_unstemmed
to remove undesired effects;
2) match('"apple and oranges"') return only "apple and oranges", nothing less, nothing more;
3) match('"apple oranges"') return only "apple oranges" nothing less, nothing more
now stopwords will be kinda of 'ignored' in the search
Well, that's the purpose of stopwords, isn't it? If you want and
to be treated as a stopword in some cases and as a regular word in others, the only way to do it seems to be manually adjusting your search query. For example, when and
is inside a phrase, don't include it; when it's outside a phrase, exclude it.
@sanikolaev, I think that's exactly what stopwords are for outside the context of exact phrase searches. For an exact search, though, I’d like stopwords to be treated just like any other word—which makes sense why there’s a test specifically for that
For an exact search, though, I’d like stopwords to be treated just like any other word
This is not how stopwords work in Manticore. A stopword is removed both when indexing text and when making a search query, for custom behaviour, I'm afraid you have to exclude a stopword before passing the query to Manticore.
However, it seems like a bug to me that "apple and b"
cannot find the document containing apple and b
when these options are used:
mysql> drop table if exists t; create table t(f text) stopwords='en' morphology='lemmatize_en_all' index_exact_words='1'; insert into t values(1, 'apple and b'); select * from t where match('"apple and b"');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
create table t(f text) stopwords='en' morphology='lemmatize_en_all' index_exact_words='1'
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t values(1, 'apple and b')
--------------
Query OK, 1 row affected (0.01 sec)
--------------
select * from t where match('"apple and b"')
--------------
Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---
We'll look into this.
Hi @tomatolog, this scenario could be considered a bug?
I not sure about is it a bug or wrong combination of the options such as
morphology = 'lemmatize_en_all, libstemmer_en' stopwords_unstemmed = '1'
not sure why use lemmatize_en_all
and libstemmer_en
together along with the stopwords_unstemmed
.
I need first to debug the case how it got processed from the indexing to query parsing and matching to make sure is it a bug and should be fixed or wrong options combo and the combo should be prohibited.
For now I'd reduce options count and make sure all your cases work fine.
I've just fixed the issue described here at https://github.com/manticoresoftware/manticoresearch/commit/d59d8d31a6411862cccc7e11d76ea5714f8de4e7
You need to install package from the dev repository after CI pass and reindex your data from scratch as indexer and daemon with the bug index data with wrong hit positions that is why all position related full-text operators failed to match such cases with the wrong indexed data.
@tomatolog
I don't know if we will be able to test it right away, but I'll keep you posted. Thanks a lot.
Bug Description:
Issue Summary:
When performing a phrase search on a full-text field 'title' that includes stopwords, no results are returned even when an exact match exists in the data.
Detailed Description:
Let's suppose I have a full-text field called 'title' in my index. One of the documents has the title: "Apples and Banana"
When I execute a phrase search using:
MATCH('@(title) "Apples and Banana"')
I expect to retrieve the document with that exact title. However, no results are returned.
After some experimentation, I found that searching for
MATCH('@(title) "Apples Banana"')
(removing the stopword 'and') yields the same result (no matches). Other cases behave similarly. This behavior suggests that Manticore is stripping out stopwords during phrase searches, which prevents exact phrase matches when stopwords are included.Why This Is a Problem:
Phrase searches are intended to match exact sequences of words in the text, preserving both the order and the content—including stopwords. Removing stopwords before executing a phrase search undermines the purpose of an exact match search.
Expected Behavior:
The phrase search
MATCH('@(title) "Apples and Banana"')
should return the document with the title "Apples and Banana".Manticore Search Version:
6.3.6
Operating System Version:
Any
Have you tried the latest development version?
No
Internal Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.