hightman / scws

开源免费的简易中文分词系统,PHP分词的上乘之选!
http://www.xunsearch.com/scws/
Other
1.66k stars 348 forks source link

你好,hightman,请问下我使用PHP添加自定义词组时,报错? #56

Open nottellyou opened 6 years ago

nottellyou commented 6 years ago

692 $so = scws_new(); 693 $so->set_charset('utf8'); 694 // 这里没有调用 set_dict 和 set_rule 系统会自动试调用 ini 中指定路径下的词典和规则文件 695 //$dictPath = ini_get('scws.default.fpath').'/dict.utf8.xdb'; 696 //$so->set_dict($dictPath);//设置词典 697 698 //$so->set_dict('/usr/local/scws/etc/dict.utf8.xdb'); 699 $so->add_dict('/usr/local/scws/etc/dict.user.txt'); 700 //$so->set_rule('/usr/local/scws/etc/rules.utf8.ini'); 701 702 $so->set_duality(true);//设定是否将闲散文字自动以二字分词法聚合。 703 $so->set_ignore(true);//设定分词返回结果时是否去除一些特殊的标点符号之类。 704 $so->set_multi(1);//按位异或的 1 | 2 | 4 | 8 分别表示: 短词 | 二元 | 主要单字 | 所有单字 705 706 $so->send_text("我是一个中国人,我会C++语言,我也有很多T恤衣服,我的衣服比我还重老司机遇上新能源遇上新能源这个分词怎么分"); 707 echo '\<pre>'; 708 //$tmp = $so->get_result(); 709 //$tmp = $so->get_tops(6, '~V'); 710 $tmp = $so->get_tops(7); 711 foreach($tmp as $v) 712 { 713 print_r($v); 714 } 715 $so->close();

总是在 报699行 $so->add_dict('/usr/local/scws/etc/dict.user.txt'); 错误,我想添加一些自定义的词组:老司机。

请问是哪里出了问题呢?

谢谢

nottellyou commented 6 years ago

知道了,是加一个SCWS_XDICT_TXT参数就OK了。

nottellyou commented 6 years ago

再问一个问题:怎样去掉一些语气助词还有某些不可能用的词:

Array ( [word] => 收入 [times] => 4 [weight] => 19.559999465942 [attr] => n ) Array ( [word] => 可以 [times] => 4 [weight] => 18.680000305176 [attr] => v ) Array ( [word] => 返利 [times] => 2 [weight] => 16.979999542236 [attr] => v ) Array ( [word] => 不仅 [times] => 3 [weight] => 14.849999427795 [attr] => c ) Array ( [word] => 也许 [times] => 3 [weight] => 14.819999694824 [attr] => d ) Array ( [word] => 他们 [times] => 3 [weight] => 14.760000228882 [attr] => r ) Array ( [word] => 拥有 [times] => 3 [weight] => 14.700000762939 [attr] => v ) Array ( [word] => 优惠 [times] => 3 [weight] => 14.549999237061 [attr] => vn ) Array ( [word] => 如果 [times] => 3 [weight] => 14.460000991821 [attr] => c ) Array ( [word] => 财富 [times] => 3 [weight] => 14.400000572205 [attr] => n ) Array ( [word] => 消费 [times] => 3 [weight] => 14.130000114441 [attr] => vn ) Array ( [word] => 自己 [times] => 3 [weight] => 13.650000572205 [attr] => r )

像这篇文章分词结果中的:如果、自己、不仅、也许、他们……排除掉呢???

hightman commented 6 years ago

自行根据词性排除

Best Regards

hightman/海鳗


微信/微博:hightman Github:https://github.com/hightman http://github.com/hightman

在 2018年10月18日,上午10:58,nottellyou <notifications@github.com mailto:notifications@github.com> 写道:

再问一个问题:怎样去掉一些语气助词还有某些不可能用的词:

Array ( [word] => 收入 [times] => 4 [weight] => 19.559999465942 [attr] => n ) Array ( [word] => 可以 [times] => 4 [weight] => 18.680000305176 [attr] => v ) Array ( [word] => 返利 [times] => 2 [weight] => 16.979999542236 [attr] => v ) Array ( [word] => 不仅 [times] => 3 [weight] => 14.849999427795 [attr] => c ) Array ( [word] => 也许 [times] => 3 [weight] => 14.819999694824 [attr] => d ) Array ( [word] => 他们 [times] => 3 [weight] => 14.760000228882 [attr] => r ) Array ( [word] => 拥有 [times] => 3 [weight] => 14.700000762939 [attr] => v ) Array ( [word] => 优惠 [times] => 3 [weight] => 14.549999237061 [attr] => vn ) Array ( [word] => 如果 [times] => 3 [weight] => 14.460000991821 [attr] => c ) Array ( [word] => 财富 [times] => 3 [weight] => 14.400000572205 [attr] => n ) Array ( [word] => 消费 [times] => 3 [weight] => 14.130000114441 [attr] => vn ) Array ( [word] => 自己 [times] => 3 [weight] => 13.650000572205 [attr] => r )

像这篇文章分词结果中的:如果、自己、不仅、也许、他们……排除掉呢???

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hightman/scws/issues/56#issuecomment-430858548, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxlXeRlBpZ8ZMwRt4dXV6WIJ1PhJ7p8ks5ul-5CgaJpZM4XsgvD.

nottellyou commented 6 years ago

我在词性里加入了:$tmp = $so->get_tops(100, '~v,~d,~y,~e,~r,~a'); 没用。 Array ( [word] => 不仅 [times] => 3 [weight] => 14.849999427795 [attr] => c ) Array ( [word] => 也许 [times] => 3 [weight] => 14.819999694824 [attr] => d ) Array ( [word] => 他们 [times] => 3 [weight] => 14.760000228882 [attr] => r ) Array ( [word] => 如果 [times] => 3 [weight] => 14.460000991821 [attr] => c ) Array ( [word] => 财富 [times] => 3 [weight] => 14.400000572205 [attr] => n ) Array ( [word] => 消费 [times] => 3 [weight] => 14.130000114441 [attr] => vn ) Array ( [word] => 自己 [times] => 3 [weight] => 13.650000572205 [attr] => r ) 需要大侠指点一下, 哪里设置的不对?

hightman commented 6 years ago

~v,d,y,e,r,a 而不是每个前面都加~

Best Regards

hightman/海鳗


微信/微博:hightman Github:https://github.com/hightman

在 2018年10月18日,下午12:09,nottellyou notifications@github.com 写道:

我在词性里加入了:$tmp = $so->get_tops(100, '~v,~d,~y,~e,~r,~a'); 没用。 Array ( [word] => 不仅 [times] => 3 [weight] => 14.849999427795 [attr] => c ) Array ( [word] => 也许 [times] => 3 [weight] => 14.819999694824 [attr] => d ) Array ( [word] => 他们 [times] => 3 [weight] => 14.760000228882 [attr] => r ) Array ( [word] => 如果 [times] => 3 [weight] => 14.460000991821 [attr] => c ) Array ( [word] => 财富 [times] => 3 [weight] => 14.400000572205 [attr] => n ) Array ( [word] => 消费 [times] => 3 [weight] => 14.130000114441 [attr] => vn ) Array ( [word] => 自己 [times] => 3 [weight] => 13.650000572205 [attr] => r )

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hightman/scws/issues/56#issuecomment-430869023, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxlXTGEou7h1Vm8V8QDToRMokAAdcQGks5ul_7ygaJpZM4XsgvD.

nottellyou commented 6 years ago

请问scws词性和这里的词性https://blog.csdn.net/leiting_imecas/article/details/68484811?utm_source=blogxgwz1 一样吗?