hankcs / HanLP

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
https://hanlp.hankcs.com/en/
Apache License 2.0
33.87k stars 10.13k forks source link

英文文本分类速度好慢,怎么办 #1237

Closed pangwh closed 4 years ago

pangwh commented 5 years ago

注意事项

请确认下列注意事项:

版本号

当前最新版本号是: 我使用的版本是:

我的问题

我从谷歌上下载了文本分类的语料库,一共有8000篇10多个种类的文章,我已经训练成了模型,但是运行分类的时候速度特别慢,大概要两分钟才识别出来,中文的大概两秒钟,为什么英文和中文的差距那么大呢,哪里出的问题呢,我大概看了在获取模型的时候时间很长NaiveBayesModel model = (NaiveBayesModel) IOUtil.readObjectFrom(MODEL_PATH);为什么会这样呢,请大神帮忙分析一下。

我的语料库大概长这样 From: admiral@jhunix.hcf.jhu.edu (Steve C Liu) Subject: Re: Bring on the O's Organization: Homewood Academic Computing, Johns Hopkins University, Baltimore, Md, USA Lines: 39 Distribution: world Expires: 5/9/95 NNTP-Posting-Host: jhunix.hcf.jhu.edu Summary: Root, root, root for the Orioles...

I heard that Eli is selling the team to a group in Cinninati. This would help so that the O's could make some real free agent signings in the offseason. Training Camp reports that everything is pretty positive right now. The backup catcher postion will be a showdown between Tackett and Parent although I would prefer Parent. #1 Draft Pick Jeff Hammonds may be coming up faster in the O's hierarchy of the minors faster than expected. Mike Flanagan is trying for another comeback. Big Ben is being defended by coaches saying that while the homers given up were an awful lot, most came in the beginning of the season and he really improved the second half. This may be Ben's year. I feel that while this may not be Mussina's Cy Young year, he will be able to pitch the entire season without periods of fatigue like last year around August. I really hope Baines can provide the RF support the O's need. Orsulak was decent but I had hoped that Chito Martinez could learn defense better and play like he did in '91. The O's right now don't have many left-handed hitters. Anderson proving last year was no fluke and Cal's return to his averages would be big plusses in a drive for the pennant. The rotation should be Sutcliffe, Mussina, McDonald, Rhodes, ?????. Olson is an interesting case. Will he strike out the side or load the bases and then get

复现问题

步骤

  1. 首先……
  2. 然后……
  3. 接着……

触发代码

NaiveBayesModel model = (NaiveBayesModel) IOUtil.readObjectFrom(MODEL_PATH);

期望输出

期望输出

实际输出

实际输出

其他信息

hankcs commented 5 years ago

特征多,模型大,IO慢。

hankcs commented 4 years ago

感谢您对HanLP1.x的支持,我一直为没有时间回复所有issue感到抱歉,希望您提的问题已经解决。或者,您可以从《自然语言处理入门》中找到答案。

时光飞逝,HanLP1.x感谢您的一路相伴。我于东部标准时间2019年12月31日发布了HanLP1.x在上一个十年最后一个版本,代号为最后的武士。此后1.x分支将提供稳定性维护,但不是未来开发的焦点。

值此2020新年之际,我很高兴地宣布,HanLP2.0发布了。HanLP2.0的愿景是下一个十年的前沿NLP技术。为此,HanLP2.0采用TensorFlow2.0实现了最前沿的深度学习模型,通过精心设计的框架支撑下游NLP任务,在海量语料库上取得了最前沿的准确率。作为第一个alpha版本,HanLP 2.0.0a0支持分词、词性标注、命名实体识别、依存句法分析、语义依存分析以及文本分类。而且,这些功能并不仅限中文,而是面向全人类语种设计。HanLP2.0提供许多预训练模型,而终端用户仅需两行代码即可部署,深度学习落地不再困难。更多详情,欢迎观看HanLP2.0的介绍视频,或参与论坛讨论

展望未来,HanLP2.0将集成1.x时代继承下来的高效率务实风范,同时冲刺前沿研究,做工业界和学术界的两栖战舰,请诸君继续多多指教,谢谢。