hiDaDeng / cntext

文本分析包,支持字数统计、可读性、文档相似度、情感分析在内的多种文本分析方法。chinese text sentiment analysis
MIT License
271 stars 28 forks source link

使用DUTIR词典报错 #5

Open AirFin opened 2 years ago

AirFin commented 2 years ago

运行代码

import cntext as ct

text = '我今天得奖了,很高兴,我要将快乐分享大家。'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'],
             lang='chinese')

报错

Traceback (most recent call last):
  File "d:\PythonProject\test\test_cntext.py", line 5, in <module>
    ct.sentiment(text=text,
  File "D:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py", line 159, in sentiment
    jieba.add_word(w)
  File "D:\Miniconda3\envs\py38\lib\site-packages\jieba\__init__.py", line 426, in add_word
    word = strdecode(word)
  File "D:\Miniconda3\envs\py38\lib\site-packages\jieba\_compat.py", line 79, in strdecode
    sentence = sentence.decode('utf-8')
AttributeError: 'int' object has no attribute 'decode'

如果不使用DUTIR词典,使用其他词典,则可以正常运行,如:

import cntext as ct

text = '我今天得奖了,很高兴,我要将快乐分享大家。'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('HOWNET.pkl')['HOWNET'],
             lang='chinese')

运行结果

{'deny_num': 0,
 'ish_num': 0,
 'more_num': 0,
 'neg_num': 0,
 'pos_num': 3,
 'very_num': 1,
 'stopword_num': 8,
 'word_num': 14,
 'sentence_num': 1}
hiDaDeng commented 2 years ago

你的数据中text可能有纯文数字或者缺失值字段

---原始邮件--- 发件人: @.> 发送时间: 2022年7月6日(周三) 中午12:47 收件人: @.>; 抄送: @.***>; 主题: [hiDaDeng/cntext] 使用DUTIR词典报错 (Issue #5)

运行代码 import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'], lang='chinese')

报错 Traceback (most recent call last): File "d:\PythonProject\test\test_cntext.py", line 5, in <module> ct.sentiment(text=text, File "D:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py", line 159, in sentiment jieba.add_word(w) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba__init__.py", line 426, in add_word word = strdecode(word) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba_compat.py", line 79, in strdecode sentence = sentence.decode('utf-8') AttributeError: 'int' object has no attribute 'decode'
如果不使用DUTIR词典,使用其他词典,则可以正常运行,如: import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('HOWNET.pkl')['HOWNET'], lang='chinese')

运行结果 {'deny_num': 0, 'ish_num': 0, 'more_num': 0, 'neg_num': 0, 'pos_num': 3, 'very_num': 1, 'stopword_num': 8, 'word_num': 14, 'sentence_num': 1}
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

hiDaDeng commented 2 years ago

抱歉,没仔细看问题。我觉得如果DUTIR换成Hownet就ok,那应该是词典问题。

词典问题的话,先保证导入的词典符合统一的字典样式。cntext仓库中有关于标准字典样式的词典小案例。

---原始邮件--- 发件人: @.> 发送时间: 2022年7月6日(周三) 中午12:47 收件人: @.>; 抄送: @.***>; 主题: [hiDaDeng/cntext] 使用DUTIR词典报错 (Issue #5)

运行代码 import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'], lang='chinese')

报错 Traceback (most recent call last): File "d:\PythonProject\test\test_cntext.py", line 5, in <module> ct.sentiment(text=text, File "D:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py", line 159, in sentiment jieba.add_word(w) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba__init__.py", line 426, in add_word word = strdecode(word) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba_compat.py", line 79, in strdecode sentence = sentence.decode('utf-8') AttributeError: 'int' object has no attribute 'decode'
如果不使用DUTIR词典,使用其他词典,则可以正常运行,如: import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('HOWNET.pkl')['HOWNET'], lang='chinese')

运行结果 {'deny_num': 0, 'ish_num': 0, 'more_num': 0, 'neg_num': 0, 'pos_num': 3, 'very_num': 1, 'stopword_num': 8, 'word_num': 14, 'sentence_num': 1}
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

AirFin commented 2 years ago

抱歉,没仔细看问题。我觉得如果DUTIR换成Hownet就ok,那应该是词典问题。 词典问题的话,先保证导入的词典符合统一的字典样式。cntext仓库中有关于标准字典样式的词典小案例。 ---原始邮件--- 发件人: @.> 发送时间: 2022年7月6日(周三) 中午12:47 收件人: @.>; 抄送: @.>; 主题: [hiDaDeng/cntext] 使用DUTIR词典报错 (Issue #5) 运行代码 import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'], lang='chinese') 报错 Traceback (most recent call last): File "d:\PythonProject\test\test_cntext.py", line 5, in <module> ct.sentiment(text=text, File "D:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py", line 159, in sentiment jieba.add_word(w) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba__init__.py", line 426, in add_word word = strdecode(word) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba_compat.py", line 79, in strdecode sentence = sentence.decode('utf-8') AttributeError: 'int' object has no attribute 'decode' 如果不使用DUTIR词典,使用其他词典,则可以正常运行,如: import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('HOWNET.pkl')['HOWNET'], lang='chinese') 运行结果 {'deny_num': 0, 'ish_num': 0, 'more_num': 0, 'neg_num': 0, 'pos_num': 3, 'very_num': 1, 'stopword_num': 8, 'word_num': 14, 'sentence_num': 1} — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

您好,感谢您的回复。按照您的提示进行修改,仍报错。

我使用的python版本:3.8.5

我的完整代码如下

import cntext as ct 
d:\Miniconda3\envs\py38\lib\site-packages\numpy\_distributor_init.py:30: UserWarning: loaded more than 1 DLL from .libs:
d:\Miniconda3\envs\py38\lib\site-packages\numpy\.libs\libopenblas.4SP5SUA7CBGXUEOC35YP2ASOICYYEQZZ.gfortran-win_amd64.dll
d:\Miniconda3\envs\py38\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
  warnings.warn("loaded more than 1 DLL from .libs:"
d:\Miniconda3\envs\py38\lib\site-packages\gensim\similarities\__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
print(ct.__version__)
# 导入pkl词典文件,
ct.load_pkl_dict('DUTIR.pkl')
1.7.4
Output exceeds the size limit. Open the full output data in a text editor
{'DUTIR': {'乐': ['急若流星',
   '最后一根稻草',
   '慌乱',
   '张皇',
   '心如悬旌',
   '鞋里长草-慌了脚',
   '紧急',
   '五色无主',
   '脚忙手乱',
   '仓卒应战',
   '缓不济急',
   '忡忡',
   '风声鹤唳',
   '心慌意乱',
   '心虚',
   '体力不支',
   '窘急',
   '惊慌失措',
   '惊慌',
   '发急',
   '心急火燎',
   '芒刺在背',
   '着慌',
   '心切',
   '手忙脚乱',
...
   '恰巧',
   '意出望外',
   '怨不得']},
 'Desc': '大连理工大学情感本体库,细粒度情感词典。含七大类情绪,依次是哀, 好, 惊, 惧, 乐, 怒, 恶',
 'Referer': '徐琳宏,林鸿飞,潘宇,等.情感词汇本体的构造[J]. 情报学报, 2008, 27(2): 180-185.'}
text = '我今天得奖了,很高兴,我要将快乐分享大家。'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'],
             lang='chinese')
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9260/3488132061.py in <module>
      1 text = '我今天得奖了,很高兴,我要将快乐分享大家。'
      2 
----> 3 ct.sentiment(text=text,
      4              diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'],
      5              lang='chinese')

d:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py in sentiment(text, diction, lang)
    157             senti_category_words = diction[senti_category]
    158             for w in senti_category_words:
--> 159                 jieba.add_word(w)
    160 
    161         sentence_num = len(cn_seg_sent(text))

d:\Miniconda3\envs\py38\lib\site-packages\jieba\__init__.py in add_word(self, word, freq, tag)
    424         """
    425         self.check_initialized()
--> 426         word = strdecode(word)
    427         freq = int(freq) if freq is not None else self.suggest_freq(word, False)
    428         self.FREQ[word] = freq

d:\Miniconda3\envs\py38\lib\site-packages\jieba\_compat.py in strdecode(sentence)
     77     if not isinstance(sentence, text_type):
...
---> 79             sentence = sentence.decode('utf-8')
     80         except UnicodeDecodeError:
     81             sentence = sentence.decode('gbk', 'ignore')

AttributeError: 'int' object has no attribute 'decode'
AirFin commented 2 years ago

抱歉,没仔细看问题。我觉得如果DUTIR换成Hownet就ok,那应该是词典问题。 词典问题的话,先保证导入的词典符合统一的字典样式。cntext仓库中有关于标准字典样式的词典小案例。 ---原始邮件--- 发件人: @.> 发送时间: 2022年7月6日(周三) 中午12:47 收件人: @.>; 抄送: @.>; 主题: [hiDaDeng/cntext] 使用DUTIR词典报错 (Issue #5) 运行代码 import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'], lang='chinese') 报错 Traceback (most recent call last): File "d:\PythonProject\test\test_cntext.py", line 5, in <module> ct.sentiment(text=text, File "D:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py", line 159, in sentiment jieba.add_word(w) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba__init__.py", line 426, in add_word word = strdecode(word) File "D:\Miniconda3\envs\py38\lib\site-packages\jieba_compat.py", line 79, in strdecode sentence = sentence.decode('utf-8') AttributeError: 'int' object has no attribute 'decode' 如果不使用DUTIR词典,使用其他词典,则可以正常运行,如: import cntext as ct text = '我今天得奖了,很高兴,我要将快乐分享大家。' ct.sentiment(text=text, diction=ct.load_pkl_dict('HOWNET.pkl')['HOWNET'], lang='chinese') 运行结果 {'deny_num': 0, 'ish_num': 0, 'more_num': 0, 'neg_num': 0, 'pos_num': 3, 'very_num': 1, 'stopword_num': 8, 'word_num': 14, 'sentence_num': 1} — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

我又新建了一个python3.7.9的环境,运行相同的代码,还是同样的报错。

hiDaDeng commented 2 years ago

更新至于1.7.5

pip3 install cntext==1.7.6

AirFin commented 2 years ago

更新至于1.7.5

pip3 install cntext==1.7.6

成功解决。非常感谢!