关于庖丁的多种分词策略，我都已经了解，我还想了解另一种分词策略，望王志老师帮忙看看

mountainmoon / paoding

Automatically exported from code.google.com/p/paoding

0 stars 0 forks source link

关于庖丁的多种分词策略，我都已经了解，我还想了解另一种分词策略，望王志老师帮忙看看 #59

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

问题步骤：
1.用最长匹配分词法对“提供计算机设计方案”进行分词
2.分词结果“提供/计算机/机设/设计方案/”

问题：既然选用了最长匹配分词法，我觉得没必要进行交叉��
�词了。如何可以把交叉分词功能去
掉？可以在配置文档进行配置吗？如果要改代码？是不是应��
�修改TokenCollector？谢谢

Original issue reported on code.google.com by yuweimin...@gmail.com on 18 Mar 2010 at 2:15

GoogleCodeExporter commented 9 years ago

先谢谢你的反馈！

分词策略的选择，的确会出现您陈述的“交叉”分词，因为pa
oding没有默认提供相应的实现，所以现在无法
通过配置去除。

不过您可参考庖丁在分词策略选择上的实现(没记错的话应该�
��是TokenCollector）来做，
扩展一个新的TokenCollector！

reno，你的意见是？

对了,因为这个issure的回复不一定是某个人，所以如果没有指�
��回答人时，这样可能会有更多人帮忙你。

Original comment by qieqie.wang on 18 Mar 2010 at 2:30

GoogleCodeExporter commented 9 years ago

其实用最长匹配分词发也需要将某些交叉分词保留的，如 
红楼梦新探序言 =>红楼/红楼梦/新探/序言/红楼梦
新探序言，这里面有对“红楼”有三次交叉，如果不保留交��
�，按最长匹配就会只分出"红楼梦新探序言"，这
样当用户查找“红楼梦”的时候就搜不到这个结果。
分词交叉是可以说是庖丁分词相较于其他分词的一个特点，��
�种方式对lucene搜索是有利的。

Original comment by reno....@gmail.com on 18 Mar 2010 at 2:59

GoogleCodeExporter commented 9 years ago

好的，不好意思，因为我第一次第一次发issue，所以不清楚规
则，以后会注意的了。因为我们这次做的项目
比较特殊，单纯的索引与检索的了，是跟词频，跟次元，跟��
�距有关的一个项目，所以我是比较希望不交叉，
希望qieqie指点一下如何修改TokenCollector。

Original comment by yuweimin...@gmail.com on 18 Mar 2010 at 9:19

GoogleCodeExporter commented 9 years ago

我是希望扩展一个TokenCollector，我不想破坏原有的Paoding的分��
�规则。

Original comment by yuweimin...@gmail.com on 18 Mar 2010 at 9:22

GoogleCodeExporter commented 9 years ago

庖丁现在还不能简单的通过修改配置来植入用户定义的TokenCol
lector，要是想自己扩展的话，
可以参考net.paoding.analysis.analyzer.impl.MostWordsTokenCollector的实现�
��
另外，一个变通的办法是采用更适合你要求的分词法，例如Lu
cene自带的smart-cn分词法就能满
足你的要求。

Original comment by reno....@gmail.com on 18 Mar 2010 at 9:58

GoogleCodeExporter commented 9 years ago

谢谢reno，我怕smart-cn的分词效果没有paoding的好，paoding有词库
，可以扩展，我还是选择用庖丁扩
展。

Original comment by yuweimin...@gmail.com on 18 Mar 2010 at 10:44

GoogleCodeExporter commented 9 years ago

以前的版本分词有不交叉的分词法吗？只要把之前的TokenCollec
tor移植过来就行了，我试过一次，把
2.0.4的Alpha版本的最长匹配分词法拿过来用，完全没有问题。�
��过要我自己写或者自己改可能要费很大
劲，希望有心人士能帮我解决一下这个问题。

Original comment by yuweimin...@gmail.com on 18 Mar 2010 at 11:17

GoogleCodeExporter commented 9 years ago

“红楼梦新探序言” 
按照最长匹配法，不应该缺少“红楼梦”这个词，因为红楼��
�这个词比红楼这个词要长，
所以分解出来的应该是红楼梦而不是红楼，

Original comment by yuweimin...@gmail.com on 21 Mar 2010 at 3:31

GoogleCodeExporter commented 9 years ago

假如我现在对“红楼梦我爱你”一句话进行分词，假如结果��
�“红楼/红楼梦/梦我/我爱/爱你/我爱你”
分词结果之中可以看出有交叉分词。
由于我是使用最大分词法，所以我觉得结果是“红楼梦/我爱�
��”，因为“红楼梦”跟“我爱你”都是最大的词
源。我觉得这样更能反映句子的语义

Original comment by yuweimin...@gmail.com on 21 Mar 2010 at 3:40

GoogleCodeExporter commented 9 years ago

呵呵，经过几天的研究，终于解决了交叉分词了。原来庖丁��
�扩展性比我想象当中的要好很多。

Original comment by yuweimin...@gmail.com on 22 Mar 2010 at 6:54

GoogleCodeExporter commented 9 years ago

能说一下你怎么解决这个问题的吗？因为我自己在项目中也��
�要最长匹配，想去掉交叉分词。

Original comment by yunbia...@gmail.com on 25 Apr 2010 at 12:23

GoogleCodeExporter commented 9 years ago

请假下，如何进行最大匹配法而避免交叉分词的方法？

Original comment by Baizhang...@gmail.com on 20 May 2013 at 3:42