mrkamel / search_cop

Search engine like fulltext query support for ActiveRecord
MIT License
825 stars 39 forks source link

Ignoring MySQL stop words for full-text search #43

Open parthm opened 5 years ago

parthm commented 5 years ago

Is there a way to ignore stop-words for MySQL full-text search? At the moment if we enter a search string that contains stop words, the stop words also get added to the SQL query with + signand the search result is empty.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger +in +a +strange' IN BOOLEAN MODE)));
Empty set (0.00 sec)

Note that the above query returns an empty set.

The below query works fine as the stop words are removed.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger +strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)

The stop gap can be that at the application level we could filter stop words before submitting the query. However, I would think this should be a common issue and it would be great if search_cop could help with this.

mrkamel commented 5 years ago

Hi, thanks for your feedback!

There is no stop word handling implemented within search_cop currently and no DMBS related code regarding stopwords either. So, if MySQL isn't handling the stopwords properly, you first need to double-check the MySQL docs to fix it (if possible). I'd guess the + operators prevent MySQL from removing them, but the + operators are neccessary and can't be removed.

Personally, i don't want the DMBS/Searchengine/etc. to handle synonyms/stopwords/etc, as this often fails short when i18n or multi-word synonyms/stopwords are desired. Thus, i would do synonym/stopword mapping before passing the query to search cop as well. However, search_cop maybe could provide a synoym/stopword mapping engine:

search_scope :search do
  attributes :title, :description

 stopwords do |phrase|
    redis.sismember("stopwords:en", phrase)
  end
end

such that phrase will be removed from the query in case the stopword is member of the redis stopword set for english. However, maybe you could double check the MySQL docs and document you findings first.

parthm commented 5 years ago

I'd guess the + operators prevent MySQL from removing them

That's correct. The search query entered is in this example is stranger in a strange land. The same query done in SQL without the + preceding the stopwords works fine.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.01 sec)

However, search_cop maybe could provide a synoym/stopword mapping engine:

Agreed. Some sort of an optional pull in makes sense as different users may have different needs. It would certainly be very helpful if the users could simply set it up.

At the moment I have fixed it as below in the application controller code as shown below.


    @posts = Post.search(remove_stopwords(params[:search][:query]))

   # Private method much later ....

    # SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
    STOP_WORDS = %w{ a about an are as at be by com de en for from how i
      in is it la of on or that the this to was what when where who will
      with und the www }.freeze
    def remove_stopwords(query)
      words = query.split(/\s+/).select do |w|
        (w.length >= 3) && (w.length <= 84) && !STOP_WORDS.include?(w)
      end
      words.join(' ')
    end

Regarding the length checks above, The MySQL manual here has a note for InnoDB stating:

By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB full-text search index.

So, if the user were to put in x, it would become +x and cause the search to fail even though it's not a stop word. Hence, the length checks.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +x +strange' IN BOOLEAN MODE)));
Empty set (0.00 sec)

The stop words for InnoDB are pull using the query mentioned on the page. The page also mentions MyISAM stopwords.

parthm commented 5 years ago

Just another thought from a solution perspective. Is it possible to disable the + being added via an option? I would think this should work for InnoDB and MyISAM any perhaps PostgreSQL without having to muck around with stopwords, language specific stop words etc.

Without the plus, the query works just fine.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)

This would make the query more flexible and the user could add + to specific words. At the moment the user needs to be right about each words, however, most of the time the user may be OK with the highest number of matches rather than all.

It may be OK to give results without foo if the search query is stranger in a strange foo.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange foo' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)
mrkamel commented 5 years ago

i probably need to evaluate the mysql fulltext syntax/behaviour once again. If i remember correctly, removing the + led to strange mysql results sometimes (imo). That said, adding an option to remove it wouldn't break things for anyone. However, the option would probably be an all-or-nothing one, such that you can only fully disable all + operators.

parthm commented 5 years ago

Thank you for considering this and for search_cop 😃 . Just wanted to mention that the above behavior is for MySQL 8.0. I don't have access to MySQL 5.7 or another version to check if that's any different.