hyunwoongko / kss

KSS: Korean String processing Suite
BSD 3-Clause "New" or "Revised" License
421 stars 61 forks source link

split_sentences 사용 시 assertionerror 발생 #49

Closed leech2193 closed 1 year ago

leech2193 commented 2 years ago

안녕하세요.

split_sentences 이용할 때 AssertionError가 발생합니다. 해당 에러에 대한 설명이 없어 무엇이 문제인지 모르겠습니다.

import kss

text = '딥 러닝 자연어 처리가 재미있기는 합니다. 그런데 문제는 영어보다 한국어로 할 때 너무 어렵습니다. 이제 해보면 알걸요?'
print('한국어 문장 토큰화 :', kss.split_sentences('테스트 해봅시다.'))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [24], in <cell line: 4>()
      1 import kss
      3 text = '딥 러닝 자연어 처리가 재미있기는 합니다. 그런데 문제는 영어보다 한국어로 할 때 너무 어렵습니다. 이제 해보면 알걸요?'
----> 4 print('한국어 문장 토큰화 :', kss.split_sentences('테스트 해봅시다.'))

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/kss/kss.py:179, in split_sentences(text, use_heuristic, use_quotes_brackets_processing, max_recover_step, max_recover_length, backend, num_workers, disable_gc, disable_mp_post_process)
    167     results += pool.map(
    168         partial(
    169             _split_sentences,
   (...)
    176         mp_input_texts,
    177     )
    178 else:
--> 179     results += [
    180         _split_sentences(
    181             text=t,
    182             use_heuristic=use_heuristic,
    183             use_quotes_brackets_processing=use_quotes_brackets_processing,
    184             max_recover_step=max_recover_step,
    185             max_recover_length=max_recover_length,
    186             backend=backend,
    187         )
    188         for t in mp_input_texts
    189     ]
    191 mp_output_final = []
    192 mp_temp.clear()

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/kss/kss.py:180, in <listcomp>(.0)
    167     results += pool.map(
    168         partial(
    169             _split_sentences,
   (...)
    176         mp_input_texts,
    177     )
    178 else:
    179     results += [
--> 180         _split_sentences(
    181             text=t,
    182             use_heuristic=use_heuristic,
    183             use_quotes_brackets_processing=use_quotes_brackets_processing,
    184             max_recover_step=max_recover_step,
    185             max_recover_length=max_recover_length,
    186             backend=backend,
    187         )
    188         for t in mp_input_texts
    189     ]
    191 mp_output_final = []
    192 mp_temp.clear()

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/kss/kss.py:292, in _split_sentences(text, use_heuristic, use_quotes_brackets_processing, max_recover_step, max_recover_length, backend, recover_step)
    289         text = text.replace(s, f"\u200b{s}\u200b")
    291 if use_morpheme:
--> 292     eojeols = _morph.pos(text=text, backend=backend)
    293 else:
    294     eojeols = [Eojeol(t, "EF+ETN") for t in text]

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/kss/morph.py:71, in MorphExtractor.pos(self, text, backend)
     61         except ImportError:
     62             raise ImportError(
     63                 "\n"
     64                 "You must install `python-mecab-kor` if you want to use `mecab` backend.\n"
     65                 "Please install using `pip install python-mecab-kor`.\n"
     66                 "Refer https://github.com/hyuwoongko/python-mecab-kor for more details.\n"
     67             )
     69     return [
     70         Eojeol(eojeol, pos[1])
---> 71         for pos in self.mecab.pos(text)
     72         for eojeol in pos[0]
     73     ]
     74 else:
     75     raise AttributeError(
     76         "Wrong backend ! currently, we only support `pynori`, `mecab`, `none` backend."
     77     )

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/kss/morph.py:102, in MecabTokenizer.pos(self, text, preserve_whitespace)
     99 text_ptr = 0
    100 results = list()
--> 102 for unit in self.mecab.pos(text):
    103     token = unit[0]
    104     pos = unit[1]

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/mecab/mecab.py:67, in MeCab.pos(self, sentence)
     65 def pos(self, sentence):
     66     return [
---> 67         (surface, feature.pos) for surface, feature in self.parse(sentence)
     68     ]

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/mecab/mecab.py:60, in MeCab.parse(self, sentence)
     57 if not self.tagger.parse(lattice):
     58     raise MeCabError(self.tagger.what())
---> 60 return [
     61     (node.surface, _extract_feature(node))
     62     for node in lattice
     63 ]

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/mecab/mecab.py:61, in <listcomp>(.0)
     57 if not self.tagger.parse(lattice):
     58     raise MeCabError(self.tagger.what())
     60 return [
---> 61     (node.surface, _extract_feature(node))
     62     for node in lattice
     63 ]

File ~/miniforge3/envs/tens2/lib/python3.9/site-packages/mecab/mecab.py:33, in _extract_feature(node)
     25 def _extract_feature(node):
     26     # Reference:
     27     # - http://taku910.github.io/mecab/learn.html
   (...)
     30     
     31     # feature = <pos>,<semantic>,<has_jongseong>,<reading>,<type>,<start_pos>,<end_pos>,<expression>
     32     values = node.feature.split(',')
---> 33     assert len(values) == 8
     35     values = [value if value != '*' else None for value in values]
     36     feature = dict(zip(Feature._fields, values))

AssertionError: 
hyunwoongko commented 2 years ago

혹시 mecab을 어떻게 설치하셨나요? 저는 아래와 같이 잘 작동합니다.

image

~/miniforge3/envs/tens2/lib/python3.9/site-packages/mecab/mecab.py를 보니 mecab이라는 패키지를 설치하시고 사용중이신 것 같은데요. kss에서는 python-mecab-kor를 사용합니다. python-mecab-kor 패키지를 설치하시고 다시 해보시겠어요?

hyunwoongko commented 1 year ago

활동이 없어서 클로징합니다. 필요하시면 리오픈 부탁드립니다.