qinwf / jiebaR

Chinese text segmentation with R. R语言中文分词 (文档已更新 🎉 :https://qinwenfeng.com/jiebaR/ )
Other
344 stars 108 forks source link

关键词提取type=‘keywords’,设置bylines=true无效的疑问 #44

Open zheguzai100 opened 8 years ago

zheguzai100 commented 8 years ago

library(jiebaR) cutter=worker(type='keywords',user = 'D:/R/soft/library/jiebaRD/dict/usrdic_20161102.utf8', stop_word = 'D:/R/soft/library/jiebaRD/dict/stop_words.utf8', ,bylines = TRUE) 563.482 518.433 208.951 199.566 190.731 "360" "手机" "数据线" "差评" "客服"

出来的结果是整个文档的关键词,如果想提取每行的关键词该怎么设置? 另,如果设置type='mix 该怎么过滤掉停用词?以下是自己尝试过滤的code,但是貌似没有效果,请帮忙修改 多谢 removewords <- function(target_words,stop_words){ target_words = target_words[target_words%in%stop_words==FALSE] return(target_words) }

stopwd=readLines('D:/R/soft/library/jiebaRD/dict/stop_words.utf8',encoding = 'UTF-8') class(stopwd) [1] character content3=sapply(content2,FUN = removewords,stopwd)

qinwf commented 8 years ago
> cc = worker("keywords")
> keywords(c("这是一段文本", "是吗"), cc)
Error in keywords(code, jiebar) : Argument 'code' must be an string.

现在 keywords 只支持导入一段文本,计划下一个版本让 simhash 和 keywords 支持 bylines 。这几天会更新。

type='mix' 可以把 worker( stop_word = 'some.file'stop_word 的路径设置为 非默认停用词文件路径即可。默认分词的时候是不载入默认停用词路径 jiebaR::STOPPATH 的,如果 stop_word 使用不是这个默认路径,分词过程就会载入自定义的停用词。