similarJ / sphinx-for-chinese

Automatically exported from code.google.com/p/sphinx-for-chinese
0 stars 0 forks source link

分布式检索:中文检索结果缺失问题! #17

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
参照《Sphinx Search Beginner Guide》书上例子,尝试了
分布式检索。对一个有180多万条记录的表,分拆各一半,
分别对应建了两个索引(items、item-2),对中文是用”一元
分词“。在本机上起了两个Searchd(端口分别为9312、9313),
然后用Java接口程序来测试:
java test -p 9312  -i master -e  检索词

检索英文词,没有问题,返回结果正常(与之前做的非分布��
�检索结果对比)
当检索中文词,发现中文检索结果有缺失,如检索”病“,��
�果是:
     '病' found 915 times in 906 documents
而之前做的非分布式检索结果是:
     '病' found 7837 times in 7825 documents

分布式检索英文词没有问题,而中文词检索结果缺失,请问��
�什么原因?

两个Conf文件如下:
#dis-1.conf 文件
source items
{
  type          = mysql
  sql_host      = localhost
  sql_user      = root
  sql_pass      = test
  sql_db        = data_monitor

  sql_query_pre = SELECT @total := count(sql_id) FROM sql_log_table

  sql_query_pre = SET @sql = CONCAT('SELECT * FROM sql_log_table limit 0,', CEIL(@total/2))

  sql_query_pre = PREPARE stmt FROM @sql

  sql_query        = EXECUTE stmt

}

index items
{
  source          = items
  path            = d:/data/items-distributed

  morphology      = none  
  min_word_len    = 1  
  charset_type    = utf-8  
  min_prefix_len  = 0  
  html_strip      = 1  
  charset_table   = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F  
  ngram_len       = 1  
  ngram_chars     = U+3000..U+2FA1F  

}

indexer
{
  mem_limit        = 128M
}

index master
{
    type        = distributed
    charset_type = utf-8

    # Local index to be searched
    local  = items
    # agent (index) to be searched
    agent = localhost:9313:items-2
}

searchd
{
  listen = 9312 
  log = d:/log/searchd-distributed.log
  query_log  = d:/log/query-distributed.log
  max_children  = 30
  max_matches     = 10000000 

  seamless_rotate = 1
  preopen_indexes = 1
  unlink_old = 1
  compat_sphinxql_magics = 0

  pid_file  = d:/log/searchd-distributed.pid
  binlog_path = 
}

# dis-2.conf 文件
source items
{
  type          = mysql
   # we will use remote host (first server)
  sql_host        = localhost
  sql_user        = root
  sql_pass      = test
  sql_db        = data_monitor
  sql_query_pre   = SET NAMES utf8

   sql_query_pre   = SELECT @total := count(sql_id) FROM sql_log_table

  sql_query_pre  = SET @sql = CONCAT('SELECT * FROM sql_log_table limit ', CEIL(@total/2), ',', CEIL(@total/2))

  # Prepare the sql statement
   sql_query_pre    = PREPARE stmt FROM @sql

   # Execute the prepared statement. This will return rows
   sql_query        = EXECUTE stmt

  # Once documents are fetched, drop the prepared statement
  sql_query_post          = DROP PREPARE stmt

}

index items-2
{
  source          = items
  path            = D:/data/items-2-distributed

  morphology      = none  
  min_word_len    = 1  
  charset_type    = utf-8  
  min_prefix_len  = 0  
  html_strip      = 1  
  charset_table   = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F  
  ngram_len       = 1  
  ngram_chars     = U+3000..U+2FA1F  
}

indexer
{
  mem_limit        = 128M
}

searchd
{
  listen = 9313 
  log = d:/log/searchd-distributed-2.log
  query_log  = d:/log/query-distributed-2.log
  max_children  = 30
  max_matches     = 10000000 
  seamless_rotate = 1
  preopen_indexes = 1
  unlink_old = 1
  compat_sphinxql_magics = 0 
  pid_file  = d:/log/searchd-distributed-2.pid
  binlog_path = 
}

Original issue reported on code.google.com by homer2...@126.com on 8 Jan 2013 at 11:51

GoogleCodeExporter commented 8 years ago
检查了一下dis-1.conf,发现在source items中
忘了加:sql_query_pre = SET NAMES utf8
而在dis-2.conf中, source items含这条语句。
怪不得检索中文只出部分结果。

重新indexer -c dis-1.conf --all,然后启动:
searchd -c dis-1.conf
searchd -c dis-2.conf

执行Java程序:
java test -p 9312  -i master -e  病
得到的结果,与之前整表数据的索引测试结果是一致的:
  '病' found 7837 times in 7825 documents

用SphinxSE接口和Mysql客户端软件(如SQLyog)也做了测试,
在Mysql中建了一个表:
CREATE TABLE `sx_dis` (
  `id` bigint(20) unsigned NOT NULL,
  `weight` int(11) NOT NULL,
  `query` varchar(3072) NOT NULL,
  KEY `query` (`query`(1024))
) ENGINE=SPHINX DEFAULT CHARSET=utf8 CONNECTION='sphinx://127.0.0.1:9312/master'

connection是指向本机9312端口和主索引名Master;

在Mysql客户端里执行:
SELECT SQL_NO_CACHE * FROM sx_dis WHERE QUERY='病;mode=extended'; 
SHOW ENGINE Sphinx STATUS;
显示:
SPHINX  stats    total: 1000, total found: 7825, time: 14, words: 1
SPHINX  words    病:7825:7837 

与Java程序测试结果是一致的。

Original comment by homer2...@126.com on 8 Jan 2013 at 1:00