mountainmoon / paoding

Automatically exported from code.google.com/p/paoding
0 stars 0 forks source link

兼容Lucene3.0.2版本的庖丁分词问题 #70

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
同步了庖丁源代码,并做了以下修改部分代码见:
http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx
测试代码:
protected static PaodingAnalyzer analyzer = new PaodingAnalyzer();
    protected static StringBuilder sb = new StringBuilder();

    protected static String dissect(String input) {
       try {
           TokenStream ts = analyzer.tokenStream("", new StringReader(input));
           ts.addAttribute(TermAttribute.class);

           while (ts.incrementToken()){
                TermAttribute ta = ts.getAttribute(TermAttribute.class);
                sb.append(ta.term());
                sb.append(" ");
           }
           return sb.toString();
       } catch (Exception e) {
           e.printStackTrace();
           return "error";
       }
    }
    /**
     * @param args
     */
    public static void main(String[] args) {
        String content = TestAnalyzer.dissect("关于印发《广东电网公司广州供电局“十一五”科技发展计划》的通知");

        System.out.println(content);
    }

分词结果:
"关于 印发 广东 电网 公司 广州 供电 供电局 十一五 25 科技 
发展 计划 通知"

为何会多出一个25?

Original issue reported on code.google.com by stt...@163.com on 10 Aug 2010 at 6:14

GoogleCodeExporter commented 9 years ago
3q, 如果的确如此,是为bug

Original comment by qieqie.wang on 10 Aug 2010 at 6:33