Closed donnekgit closed 6 years ago
Hi, The different result maybe caused by different config in jieba. And as result, the missing words like "后", "在" is Chinese particles, it is meaningless just like the word 'the'. The default config of most chinese analyzer will filter these "Chinese particles". There is another chinese analyzer project of postgres called "pg_swcs" in my github. you can also try that one.
Hi
But I don't understand why the code downloaded and compiled from the repo gives different results from the example on the front page of the repo?
I should say that I want to use pg_jieba as a tokeniser, and not in text search, so I want to retain all words, and not omit stopwords (particles, etc).
Anyway, I downloaded and installed pg_scws (thanks for the suggestion!), and it works for text search - if "她令人紧张不安。" is in the surface field of record 3 of the utterances table, I get:
qiezi=# select to_tsvector('scwscfg', surface) from utterances where id=3;
to_tsvector
----------------------------
'不安':3 '令人':1 '紧张':2
But unfortunately this omits 她, which means the output is useless as a tokeniser. Is there any way i can tell it not to omit stopwords?
You can use function 'ts_parse' to see the token. As I know, the postgres "text search" module is text search instead of tokeniser. pg_scws and pg_jieba must be a text search .
Moreover. You can try make the "stop directory file" empty to keep the stop word.
Except that ts_parse doesn't work with 'scwscfg', and I can't find the stopfile! :-)
I was going to parse the output to get the location of each word, but in fact you've put me on the right track anyway. I've now got SCWS installed as a PHP module, which means I can use it directly, and zhparser installs SCWS as a PostgreSQL extension if I want to use that. Thanks for your help!
I have installed pg_jieba on Ubuntu 16.04, PostgreSQL 9.5.11, GCC 5.4.0, cmake 3.5.1. However, it does not give the expected results.
The output is missing the final characters 的人。 On the webapp, entering the same sentence gives:
where those characters are included.
Using the sample query on the front page:
the words at 4, 7, 8, and 9 are missing compared with what the frontpage says should be the output:
and the webapp also includes all the characters:
Is there some setting that needs to be adjusted so that the installed version of pg_jieba works as expected?