Open kroky opened 7 months ago
Thanks for the issue @kroky.
The related task is https://github.com/manticoresoftware/manticoresearch/issues/2083 where a problem with backing up via mysqldump/restoring external files was revealed.
One solution is to make it possible to do:
create table
...
stopwords='a; the; smth'
wordforms='running > run; ran > run'
exceptions='AT&T > AT&T; MS Windows => ms windows'
Internally, it can still use external files created/updated automatically on create table
/ alter table
. It's just important that SHOW CREATE TABLE
uses the above format. It should also simplify things like:
What we need to ensure in this case is proper escaping, since exceptions, wordforms, and stopwords can include ;
and other characters that may be sensitive in this context (in the context of a configuration file or an SQL command). Too long values is another thing to think about.
Yes, that's a great idea @sanikolaev ! Using csv with ;
as separator and normal csv escaping procedures should probably work here. I assume queries should already be utf8 encoded as specifying special characters in these lists will be common.
There shouldn't be a problem with 2/3/4 byte characters. Example:
Discussed with the team. We're inclined to implement the following behavior changes:
exceptions
, wordforms
, stopwords
, and hitless_words
should be able to accept not only paths to files to take values from, but the values directly.exceptions/wordforms/stopwords/hitless_words = '<values>'
<values>
:
;
;
>
or =>
can be used to separate elements of a pair of values, i.e., from => to
or from > to
>
and =>
=
, >
, and ;
are to be escaped with \
to be used as non-special characters, e.g., AT\=&T\>\; => AT&T; smthelse > smth
meaning AT=T>; => AT&T
AND smthelse => smth
CREATE TABLE
). We won't support it in plain mode (in a configuration file).SHOW CREATE TABLE
should return not paths, but the values.create table
functionality can be implemented in Buddy, but it doesn't seem to be a big deal to do it right in the daemon.We'll have another round of discussion to confirm the above.
Discussed more. Updated spec:
exceptions
, wordforms
, stopwords
, and hitless_words
should be able to accept not only paths to files to take values from, but the values directly.exceptions/wordforms/stopwords/hitless_words = '<values>'
<values>
:
;
;
>
or =>
can be used to separate elements of a pair of values, i.e., from => to
or from > to
>
and =>
=
, >
, and ;
, ~
are to be escaped with \
to be used as non-special characters (rel. issue https://github.com/manticoresoftware/manticoresearch/issues/2146), e.g., AT\=&T\>\; => AT&T; smthelse > smth
meaning AT=T>; => AT&T
AND smthelse => smth
CREATE TABLE
). We won't support it in plain mode (in a configuration file).SHOW CREATE TABLE
should return not paths, but the values.create table
functionality can be implemented in Buddy, but it doesn't seem to be a big deal to do it right in the daemon.Blocked by https://github.com/manticoresoftware/manticoresearch/issues/2146
Unblocked
Is your feature request related to a problem? Please describe. Sometimes web server executed scripts have hard time finding a safe directory to write the stopwords list file that is also readable by the manticore process. We have this problem in stock Virtualmin/Tiki setup where anything under /home/user is only user-readable and system tmp directory is sticky, which introduces delete problems with the stopwords files. Furthermore, when we frequently rebuild our indexes, it is inconvenient to keep track of the stopwords file and make sure to delete it after the corresponding index is deleted.
Describe the solution you'd like Can we specify the stopword list as part of the table creation command (both sql or http json)? It could be a string of space-delimited words, a csv line, a json-encoded string or whatever you decide. Having the ability to specify the list at table creation time and forgetting about any corresponding files after that will be quite a relief.
Describe alternatives you've considered Currently, we are creating the stopwords list file on demand, specifying that on table creation and trying to remove it when index is removed. However, we run into readability and sticky-flag delete problems.
Additional context It is totally fine to impose a max length limit to the stopword list, so longer lists require a file while shorter lists might be passed as part of the command.