manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.04k stars 507 forks source link

Ability to specify stopword list as part of create table command #2046

Open kroky opened 7 months ago

kroky commented 7 months ago

Is your feature request related to a problem? Please describe. Sometimes web server executed scripts have hard time finding a safe directory to write the stopwords list file that is also readable by the manticore process. We have this problem in stock Virtualmin/Tiki setup where anything under /home/user is only user-readable and system tmp directory is sticky, which introduces delete problems with the stopwords files. Furthermore, when we frequently rebuild our indexes, it is inconvenient to keep track of the stopwords file and make sure to delete it after the corresponding index is deleted.

Describe the solution you'd like Can we specify the stopword list as part of the table creation command (both sql or http json)? It could be a string of space-delimited words, a csv line, a json-encoded string or whatever you decide. Having the ability to specify the list at table creation time and forgetting about any corresponding files after that will be quite a relief.

Describe alternatives you've considered Currently, we are creating the stopwords list file on demand, specifying that on table creation and trying to remove it when index is removed. However, we run into readability and sticky-flag delete problems.

Additional context It is totally fine to impose a max length limit to the stopword list, so longer lists require a file while shorter lists might be passed as part of the command.

sanikolaev commented 6 months ago

Thanks for the issue @kroky.

The related task is https://github.com/manticoresoftware/manticoresearch/issues/2083 where a problem with backing up via mysqldump/restoring external files was revealed.

One solution is to make it possible to do:

create table 
  ...
  stopwords='a; the; smth'
  wordforms='running > run; ran > run'
  exceptions='AT&T > AT&T; MS Windows => ms windows'

Internally, it can still use external files created/updated automatically on create table / alter table. It's just important that SHOW CREATE TABLE uses the above format. It should also simplify things like:

What we need to ensure in this case is proper escaping, since exceptions, wordforms, and stopwords can include ; and other characters that may be sensitive in this context (in the context of a configuration file or an SQL command). Too long values is another thing to think about.

kroky commented 6 months ago

Yes, that's a great idea @sanikolaev ! Using csv with ; as separator and normal csv escaping procedures should probably work here. I assume queries should already be utf8 encoded as specifying special characters in these lists will be common.

sanikolaev commented 6 months ago

There shouldn't be a problem with 2/3/4 byte characters. Example:

``` mysql> drop table if exists t; create table t(f text); insert into t values(0, 'A ç 汉 🚀'); select * from t; -------------- drop table if exists t -------------- Query OK, 0 rows affected (0.01 sec) -------------- create table t(f text) -------------- Query OK, 0 rows affected (0.00 sec) -------------- insert into t values(0, 'A ç 汉 🚀') -------------- Query OK, 1 row affected (0.00 sec) -------------- select * from t -------------- +---------------------+---------------+ | id | f | +---------------------+---------------+ | 1515858028807585797 | A ç 汉 🚀 | +---------------------+---------------+ 1 row in set (0.00 sec) --- 1 out of 1 results in 0ms --- ➜ ~ curl -s 0:9308/sql\?mode=raw -d "query=insert%20into%20t%20values%280%2C%20%27A%20%C3%A7%20%E6%B1%89%20%F0%9F%9A%80%27%29%3B"|jq . [ { "total": 1, "error": "", "warning": "" } ] ➜ ~ curl -s 0:9308/sql\?mode=raw -d "query=select%20%2A%20from%20t" [{ "columns":[{"id":{"type":"long long"}},{"f":{"type":"string"}}], "data":[ {"id":1515858028807585798,"f":"A ç 汉 🚀"}, {"id":1515858028807585797,"f":"A ç 汉 🚀"} ], "total":2, ```

Discussed with the team. We're inclined to implement the following behavior changes:

We'll have another round of discussion to confirm the above.

sanikolaev commented 6 months ago

Discussed more. Updated spec:

sanikolaev commented 6 months ago

Blocked by https://github.com/manticoresoftware/manticoresearch/issues/2146

sanikolaev commented 5 months ago

Blocked by https://github.com/manticoresoftware/manticoresearch/issues/2146

Unblocked