liangsi03 / ik-analyzer

Automatically exported from code.google.com/p/ik-analyzer
0 stars 0 forks source link

怎么获取分词结果 #33

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
比如一段文本解析成这些词语:用户|套餐|咨询|2M|陈佳
|...
怎么获取这些词语呢,以及词语出现的频率?

不好意思,我是初学者,谢谢大家指导

Original issue reported on code.google.com by lydialmr...@gmail.com on 25 Oct 2011 at 9:15

GoogleCodeExporter commented 8 years ago
public static void wordSegmentation(String source) throws IOException {
        TokenStream tokenStream = analyzer.tokenStream("tag", new StringReader(
                source));
        OffsetAttribute offsetAttribute = tokenStream
                .getAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttribute = tokenStream
                .getAttribute(CharTermAttribute.class);
        // KeywordAttribute keywordAttribute =
        // tokenStream.getAttribute(KeywordAttribute.class);
        while (tokenStream.incrementToken()) {
            int startOffset = offsetAttribute.startOffset();
            int endOffset = offsetAttribute.endOffset();
            String term = charTermAttribute.toString();
            logger.info(term);
            // logger.info(keywordAttribute.isKeyword());
            // logger.info("valid str " + source.replaceAll(term, "*"));
            logger.info(startOffset + " : " + endOffset);
        }

        logger.info("filterd abc:" + source);
    }

Original comment by loujan...@gmail.com on 16 Mar 2012 at 2:32

GoogleCodeExporter commented 8 years ago
你好,建立可以用下lukeall-3.5.0.jar工具来查看索引效果

Original comment by kingcs2008@gmail.com on 17 Jul 2012 at 8:16

GoogleCodeExporter commented 8 years ago
/**
     * 分词效果
     * @param content
     * @throws IOException
     */
    public void testAnalyzer(String content) throws IOException{
        analyzer = new IKAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("txt", new StringReader(content));
        tokenStream.addAttribute(CharTermAttribute.class);
        while (tokenStream.incrementToken()) {
            CharTermAttribute charTermAttribute = tokenStream
                    .getAttribute(CharTermAttribute.class);
            System.out.print(charTermAttribute.toString() + " | ");
        }
    }

testIKAnalyzer.testAnalyzer("北京政采科技有限公司");

北京 | 政采 | 科技 | 有限公司 | 有限 | 公司 | 

Original comment by kingcs2008@gmail.com on 23 Jul 2012 at 4:59

GoogleCodeExporter commented 8 years ago

Original comment by linliang...@gmail.com on 23 Oct 2012 at 9:34