jaiminpan / pg_jieba

Postgresql full-text search extension for chinese
BSD 3-Clause "New" or "Revised" License
341 stars 66 forks source link

pg_jieba gives unexpected results #24

Closed donnekgit closed 6 years ago

donnekgit commented 6 years ago

I have installed pg_jieba on Ubuntu 16.04, PostgreSQL 9.5.11, GCC 5.4.0, cmake 3.5.1. However, it does not give the expected results.

qiezi=# select to_tsvector('jiebacfg', '一个普通随和一点的人。');
-[ RECORD 1 ]------------------------------------
to_tsvector | '一个':1 '一点':4 '普通':2 '随和':3

The output is missing the final characters 的人。 On the webapp, entering the same sentence gives:

["一个", "普通", "随和", "一点", "的人", "。"]

where those characters are included.

Using the sample query on the front page:

qiezi=# select * from to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所, 后在日本京都大学深造');
-[ RECORD 1 ]---------------------------------------------------------------------------------
to_tsvector | '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6

the words at 4, 7, 8, and 9 are missing compared with what the frontpage says should be the output:

中国科学院':5 '于':4 '后':8 '在':9 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6 ',':7

and the webapp also includes all the characters:

["小明", "硕士", "毕业", "于", "中国科学院", "计算所", ",", " ", "后", "在", "日本京都大学", "深造"]

Is there some setting that needs to be adjusted so that the installed version of pg_jieba works as expected?

jaiminpan commented 6 years ago

Hi, The different result maybe caused by different config in jieba. And as result, the missing words like "后", "在" is Chinese particles, it is meaningless just like the word 'the'. The default config of most chinese analyzer will filter these "Chinese particles". There is another chinese analyzer project of postgres called "pg_swcs" in my github. you can also try that one.

donnekgit commented 6 years ago

Hi

But I don't understand why the code downloaded and compiled from the repo gives different results from the example on the front page of the repo?

I should say that I want to use pg_jieba as a tokeniser, and not in text search, so I want to retain all words, and not omit stopwords (particles, etc).

Anyway, I downloaded and installed pg_scws (thanks for the suggestion!), and it works for text search - if "她令人紧张不安。" is in the surface field of record 3 of the utterances table, I get:

qiezi=# select to_tsvector('scwscfg', surface) from utterances where id=3;
        to_tsvector         
----------------------------
 '不安':3 '令人':1 '紧张':2

But unfortunately this omits 她, which means the output is useless as a tokeniser. Is there any way i can tell it not to omit stopwords?

jaiminpan commented 6 years ago

You can use function 'ts_parse' to see the token. As I know, the postgres "text search" module is text search instead of tokeniser. pg_scws and pg_jieba must be a text search .

jaiminpan commented 6 years ago

Moreover. You can try make the "stop directory file" empty to keep the stop word.

donnekgit commented 6 years ago

Except that ts_parse doesn't work with 'scwscfg', and I can't find the stopfile! :-)

I was going to parse the output to get the location of each word, but in fact you've put me on the right track anyway. I've now got SCWS installed as a PHP module, which means I can use it directly, and zhparser installs SCWS as a PostgreSQL extension if I want to use that. Thanks for your help!