ikegami-yukino / pymlask

Emotion analyzer for Japanese text
Other
114 stars 26 forks source link

Fix the count of emotional words #10

Closed brunotoshio closed 5 years ago

brunotoshio commented 5 years ago

Problem The method _find_emotion was counting emotional words by searching each word against the entire text, therefore, an existing emotional word would be counted not only by itself but by all words that is included inside itself. Example:

from mlask import MLAsk

ma = MLAsk()
ma.analyze('嫌いではない')     # Incorrect => {'iya': ['嫌'], 'yorokobi': ['嫌い*CVS'], 'suki': ['嫌い*CVS']}
ma.analyze('嫌ではない')      # Correct => {'yorokobi': ['嫌*CVS'], 'suki': ['嫌*CVS']}

In this example, the text 嫌いではない contains only the emotional word 嫌い, however, both words 嫌い and are being counted because is also included inside the word 嫌い.

Resolution The method _find_emotion should search each emotional word from the dictionary by matching it with each lemma found in the text, that is, compare word by word instead of word inside the text. The expected result should be:

from mlask import MLAsk

ma = MLAsk()
ma.analyze('嫌いではない')     # Correct => {'yorokobi': ['嫌い*CVS'], 'suki': ['嫌い*CVS']}
ma.analyze('嫌ではない')      # Correct => {'yorokobi': ['嫌*CVS'], 'suki': ['嫌*CVS']}
coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.7%) to 94.545% when pulling ab99e1e771c90a7abbc2d83d47a8a45ce07ee454 on brunotoshio:fix-find-emotion into 597840ddb993c96d6634517cfa22183abe4f335a on ikegami-yukino:master.

ikegami-yukino commented 5 years ago

It looks good to me. But some entries in emotive expression dictionary are consists of 2 or more words.

E.g. https://github.com/ikegami-yukino/pymlask/blob/a0dcdd352667d69704e06b2df26b75d51f19d7fe/mlask/emotions/yorokobi_uncoded.txt#L244

$ mecab
気持ちがよい
気持ち 名詞,一般,*,*,*,*,気持ち,キモチ,キモチ
が   助詞,格助詞,一般,*,*,*,が,ガ,ガ
よい  形容詞,自立,*,*,形容詞・アウオ段,基本形,よい,ヨイ,ヨイ
EOS

currently version:

>>> import mlask
>>> mlask.MLAsk().analyze('気持ちがよい')
{'text': '気持ちがよい',
 'emotion': defaultdict(list, {'yorokobi': ['気持ちがよい']}),
 'orientation': 'POSITIVE',
 'activation': 'ACTIVE',
 'emoticon': None,
 'intension': 0,
 'intensifier': {},
 'representative': ('yorokobi', ['気持ちがよい'])}

Your PR:

>>> import mlask
>>> mlask.MLAsk().analyze('気持ちがよい')
{'text': '気持ちがよい', 'emotion': None}
brunotoshio commented 5 years ago

I see, I didn't notice these expressions. So, basically the dictionary contains words/expressions of emotion, therefore it is not possible to match against each word separately. Hm, the idea would be to match against the entire text, however, it must coincide with the start and end of words. A possible solution would be creating a set with all substrings from the text and compare each emotional word/expression against it. The main point would be to create these substrings using each word as a unit, so for the text 気持ちがよい we would compare each word/expression with a set {'気持ち', '気持ちが', '気持ちがよい', 'が', 'がよい', 'よい'}. We can put some constraint to increase performance like limiting the number of words to build each substring. What do you think?

ikegami-yukino commented 5 years ago

We can put some constraint to increase performance like limiting the number of words to build each substring. What do you think?

OK. The limitation of the number of words = 7 is best according to the following code:

import glob

import MeCab

def count_words(entry):
    t = MeCab.Tagger('-Owakati')
    return len(t.parse(entry).rstrip().split(' '))

def compute_entry_length():
    for path in glob.glob('mlask/emotions/*.txt'):
        with open(path) as f:
            for entry in f:
                entry = entry.rstrip()
                yield count_words(entry)

print(max(compute_entry_length()))
brunotoshio commented 5 years ago

I changed the constraint to 7 words.

ikegami-yukino commented 5 years ago

Thank you 👍 I merged it.