Open parthm opened 5 years ago
Hi, thanks for your feedback!
There is no stop word handling implemented within search_cop currently and no DMBS related code regarding stopwords either. So, if MySQL isn't handling the stopwords properly, you first need to double-check the MySQL docs to fix it (if possible). I'd guess the +
operators prevent MySQL from removing them, but the +
operators are neccessary and can't be removed.
Personally, i don't want the DMBS/Searchengine/etc. to handle synonyms/stopwords/etc, as this often fails short when i18n or multi-word synonyms/stopwords are desired. Thus, i would do synonym/stopword mapping before passing the query to search cop as well. However, search_cop maybe could provide a synoym/stopword mapping engine:
search_scope :search do
attributes :title, :description
stopwords do |phrase|
redis.sismember("stopwords:en", phrase)
end
end
such that phrase
will be removed from the query in case the stopword is member of the redis stopword set for english. However, maybe you could double check the MySQL docs and document you findings first.
I'd guess the
+
operators prevent MySQL from removing them
That's correct. The search query entered is in this example is stranger in a strange land
. The same query done in SQL without the +
preceding the stopwords works fine.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.01 sec)
However, search_cop maybe could provide a synoym/stopword mapping engine:
Agreed. Some sort of an optional pull in makes sense as different users may have different needs. It would certainly be very helpful if the users could simply set it up.
At the moment I have fixed it as below in the application controller code as shown below.
@posts = Post.search(remove_stopwords(params[:search][:query]))
# Private method much later ....
# SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
STOP_WORDS = %w{ a about an are as at be by com de en for from how i
in is it la of on or that the this to was what when where who will
with und the www }.freeze
def remove_stopwords(query)
words = query.split(/\s+/).select do |w|
(w.length >= 3) && (w.length <= 84) && !STOP_WORDS.include?(w)
end
words.join(' ')
end
Regarding the length checks above, The MySQL manual here has a note for InnoDB stating:
By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB full-text search index.
So, if the user were to put in x
, it would become +x
and cause the search to fail even though it's not a stop word. Hence, the length checks.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +x +strange' IN BOOLEAN MODE)));
Empty set (0.00 sec)
The stop words for InnoDB are pull using the query mentioned on the page. The page also mentions MyISAM stopwords.
Just another thought from a solution perspective. Is it possible to disable the +
being added via an option? I would think this should work for InnoDB and MyISAM any perhaps PostgreSQL without having to muck around with stopwords, language specific stop words etc.
Without the plus, the query works just fine.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)
This would make the query more flexible and the user could add +
to specific words. At the moment the user needs to be right about each words, however, most of the time the user may be OK with the highest number of matches rather than all.
It may be OK to give results without foo
if the search query is stranger in a strange foo
.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange foo' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)
i probably need to evaluate the mysql fulltext syntax/behaviour once again.
If i remember correctly, removing the +
led to strange mysql results sometimes (imo).
That said, adding an option to remove it wouldn't break things for anyone.
However, the option would probably be an all-or-nothing one, such that you can only fully disable all +
operators.
Thank you for considering this and for search_cop 😃 . Just wanted to mention that the above behavior is for MySQL 8.0. I don't have access to MySQL 5.7 or another version to check if that's any different.
Is there a way to ignore stop-words for MySQL full-text search? At the moment if we enter a search string that contains stop words, the stop words also get added to the SQL query with + signand the search result is empty.
Note that the above query returns an empty set.
The below query works fine as the stop words are removed.
The stop gap can be that at the application level we could filter stop words before submitting the query. However, I would think this should be a common issue and it would be great if search_cop could help with this.