manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.83k stars 489 forks source link

Is there a way to apply morphology to the data field in CALL PQ? #183

Closed dubadam closed 5 years ago

dubadam commented 5 years ago

Manticore Search version:
2.8.1
OS version:
Debian 9
Build version:
manticore_2.8.1-190306-3684198c-release-stemmer.stretch_amd64-bin.deb

Describe the problem

Description of the issue:

It seems that when CALL PQ is made no morphology rules are applied to the queried text (the one in data field). It means that simply iterating over all of the queries and a same - but indexed with morphology - text will result in quite a difeerent number of matches compared to CALL PQ.

Sample steps to reproduce this behaviour are provided.

Steps to reproduce:

initnialize some PQ index:
index pq_keywords
{
type = percolate
path = ...
min_infix_len = 3
rt_field = title
rt_field = body
rt_attr_uint = kwdid
morphology = lemmatize_ru, lemmatize_en # are they of any use here?
min_stemming_len = 3
min_word_len = 3
blend_chars = , &, U 23
index_exact_words = 1
}

insert into pq_keywords(query) values('Москва');

CALL PQ ('pq_keywords', 'поищем чего-нить в москвЕ', 0 AS docs, 0 AS docs_json, 0 AS verbose); - no matching queries

CALL PQ ('pq_keywords', 'поищем чего-нить в москвА', 0 AS docs, 0 AS docs_json, 0 AS verbose); - matching query exists

I assumed that both texts will result in the same query match... This limits the usability of percolate queries severely (for me).

tomatolog commented 5 years ago

I see no charset_table definition at you index, that might cause wrong tokenizing - could you add charset table to your index?

dubadam commented 5 years ago

added charset_table = non_cjk restarted daemon truncated rtindex and refilled it

didn't help

tomatolog commented 5 years ago

after you created index it stores configuration in index header that is why neither restart not truncate changes nothing. You have to issue ALTER RECONFIGURE statement or stop daemon delete index files then start daemon again

dubadam commented 5 years ago

that DID help, thank you!

dubadam commented 5 years ago

...And I have to reopen this

while behavior improved in cases: CALL PQ ('pq_keywords', 'москвЕ', 0 AS docs, 0 AS docs_json, 0 AS verbose); - match CALL PQ ('pq_keywords', 'москвА', 0 AS docs, 0 AS docs_json, 0 AS verbose); - match but with a bit more long word CALL PQ ('pq_keywords', 'под москвОЙ', 0 AS docs, 0 AS docs_json, 0 AS verbose); leads to no match

tomatolog commented 5 years ago

could you check with CALL KEYWORDS that москвОЙ got stemmed to same word as your query Москва because in case these got stemmed to different terms you have to use lemmatize_ru_all instead of lemmatize_ru to get all forms of your queries and documents

dubadam commented 5 years ago

I can't run CALL KEYWORDS over pq index (says 'not implemented') I did it with another index - performs as expected

dubadam commented 5 years ago

lemmatize_ru / lemmatize_ru_all make no difference

githubmanticore commented 5 years ago

➤ Sergey Nikolaev commented:

I did it with another index - performs as expected Do you mean they give the same lemma?

dubadam commented 5 years ago

CALL KEYWORDS ('москва москве москвой', 'verybigindex'); +------+----------------+--------------+ | qpos | tokenized | normalized | +------+----------------+--------------+ | 1 | москва | москва | | 2 | москве | москва | | 3 | москвой | москва | +------+----------------+--------------+

verybigindex config as follows: { type = plain source = src... path = /home/sphinxsearch/data/idx... docinfo = extern dict = keywords mlock = 0 morphology = lemmatize_ru, lemmatize_en min_stemming_len = 3 min_word_len = 3 min_infix_len = 3 blend_chars = +, &, U+23 index_exact_words = 1 }

tomatolog commented 5 years ago

I created index with this config

index pq
{
    type = percolate
    path            = data/pq
    rt_field = title

    dict            = keywords
    morphology      = lemmatize_ru, lemmatize_en
    charset_table   = english, _, 0..9, russian
    min_stemming_len = 3
    min_word_len = 3
    min_infix_len = 3
    blend_chars = +, &, U+23
    index_exact_words = 1   
}

inserted query into it mysql -h0 -P 9306 -vv < insert.sql

--------------
insert into pq ( query ) values('Москва')
--------------

Query OK, 1 row affected

then check matching and all works fine as expected mysql -h0 -P 9306 -vv < queries.sql

--------------
CALL PQ ('pq', 'москвой', 1 AS docs, 0 AS docs_json, 1 AS verbose)
--------------

id  documents
1   1
1 row in set

--------------
CALL PQ ('pq', 'москвОЙ', 1 AS docs, 0 AS docs_json, 1 AS verbose)
--------------

id  documents
1   1
1 row in set

--------------
CALL PQ ('pq', 'под москвОЙ', 1 AS docs, 0 AS docs_json, 1 AS verbose)
--------------

id  documents
1   1
1 row in set

--------------
CALL PQ ('pq', 'москвЕ', 1 AS docs, 0 AS docs_json, 1 AS verbose)
--------------

id  documents
1   1
1 row in set

--------------
CALL PQ ('pq', 'москвА', 1 AS docs, 0 AS docs_json, 1 AS verbose)
--------------

id  documents
1   1
1 row in set

could you provide complete example that reproduces this issue?

tomatolog commented 5 years ago

Hello, do you able to make reproducible case? Could you provide it?

dubadam commented 5 years ago

Hi!

Long story short, I can confirm that your config works. (I had to refetch dict files from manticore site rather than keeping ones from original sphinxsearch despite them looking the same; stopped all daemons and made sure they are all dead; deleted all traces of old PQ indices; rebuilt a very basic sphinx.conf).

After some research I figured out that the showstopper was charset_table option. With charset_table = non_cjk it results in aforementiones behavior

and with charsettable = english, , 0..9, russian it works as expected. But I'll keep a closer look at it after I feed it like 360K rows.

dubadam commented 5 years ago

Confirming that now everything works as expected.

Query time is very long (around 10 sec for a document; index seems to be completely in RAM but very close to scratch the limit of 4G), will try to work with batches (if they fit query size - some documents are big).

manticoresearch commented 5 years ago

@dubadam with what query and how many documents does it take 10 sec?

tomatolog commented 5 years ago

I'm going to close this ticket. You might create another ticket to investigate PQ document matching took 10 sec. However please post queries stream as you said about 360k rows its not clear there is some query that cause slow processing or just amount of queries matched.

Beside tying batches you might use dist_thread searchd option to speed up processing.