ikegami-yukino / pymlask

Emotion analyzer for Japanese text
Other
114 stars 26 forks source link

Counting different emotions #9

Closed brunotoshio closed 5 years ago

brunotoshio commented 5 years ago

Hi,

I'm not sure if this is a bug, but in some cases, some emotion words are being added without having them in the text, for example:

from mlask import MLAsk

ma = MLAsk()
ma.analyze('嫌いではない')     # => {'iya': ['嫌'], 'yorokobi': ['嫌い*CVS'], 'suki': ['嫌い*CVS']}
ma.analyze('嫌ではない')      # => {'yorokobi': ['嫌*CVS'], 'suki': ['嫌*CVS']}

In the first case, I am only using the emotion word '嫌い', however, because '嫌' shares the same stem, it is being counted as well.

In the method _find_emotion, line 233, you are comparing each emotion word against the entire text. Perhaps, comparing with each word (split from mecab) would prevent that behavior from happening.

ikegami-yukino commented 5 years ago

@brunotoshio Thank you for a reporting. Using MeCab to word tokenization is good idea. But there are many negative expression for Contextual Valence Shifters (CVS), which are used to pattern matching by multiple words level regular expression. I think pattern matching to different level chunk (single word and multiple words) is difficult and complex. For this reason, currently I recommend using representative value of result.

If you have an idea resolving this problem, please send a pull request.

brunotoshio commented 5 years ago

Hi @ikegami-yukino, I created a PR #10 to solve this issue. Hope it helps! By the way, I really like the approach to solve contextual valence shifters, however, I think it should be evaluated per sentence and not the entire text. How about splitting the text into smaller sentences by using the 'period' as a separator, then evaluating each sentence independently? It is not going to be 100% perfect but I think it would improve a bit the current implementation.

ikegami-yukino commented 5 years ago

How about splitting the text into smaller sentences by using the 'period' as a separator, then evaluating each sentence independently?

I think it is out of scope of pymlask 🤔 This package should provide only information about emotion of given text. Pre-processing is entrusted to other package (e.g. tiny_tokenizer) is better.

brunotoshio commented 5 years ago

Yes, you're right. I though that it would consider a CVS for the entire text but it is only for the emotion in question, so no problem 😆