metalcorebear / NRCLex

An affect generator based on TextBlob and the NRC affect lexicon. Note that lexicon license is for research purposes only.
MIT License
65 stars 39 forks source link

Whole new features #18

Open stormbeforesunsetbee opened 1 year ago

stormbeforesunsetbee commented 1 year ago

What's new:

Example

Text: she's always arguing!

New:

{'fear': 0.0,
 'anger': 0.3333333333333333,
 'anticipation': 0.0,
 'trust': 0.3333333333333333,
 'surprise': 0.0,
 'positive': 0.0,
 'negative': 0.3333333333333333,
 'sadness': 0.0,
 'disgust': 0.0,
 'joy': 0.0}

Previous:

{'fear': 0.0,
 'anger': 0.0,
 'anticip': 0.0,
 'trust': 0.0,
 'surprise': 0.0,
 'positive': 0.0,
 'negative': 0.0,
 'sadness': 0.0,
 'disgust': 0.0,
 'joy': 0.0}

To see all of the possible expansions

# use the old version

def expand_synonyms(lex):
    from nltk.corpus import wordnet

    lex_ = {}

    for i in lex:
        for j in wordnet.synsets(i):
            for k in j.lemmas():
                word = k.name().replace('_', ' ')

                if not word in lex:
                    lex_[word] = lex[i]

    return lex_

expand_synonyms(nrc.lexicon)
honestus commented 2 weeks ago

Hello, I've just found a potential mistake from this new feature, in the method "words_and_phrases": it basically should continue (i.e. exit from current cycle) whenever it matches a phrase made by multiple words. As it is right now, it basically updates the index of the current word, but keeps searching for words having either 2 or 1 word, and automatically adds the single token if there's no match in the lexicon.

I'll write an example in order to be as clear as possible: Let's suppose we have:

lexicon = ['this is word', 'word']; curr_tokens = ['this', 'is', 'word', 'this', 'is', 'word'] When it generates phrases with the "words_and_phrases" method, it will match the first "this is word", (i.e. curr_span=3), the current index will be correctly increased by 3, but it will keep looking for matches with curr_span<3 (i.e. curr_span=2 and curr_span=1), instead of resetting it (by exiting the current cycle).

This code below should solve the issue:

while i < n_words:
    if i + 2 < n_words:
        phrase = f'{self.words[i]} {self.words[i+1]} {self.words[i+2]}'

        if phrase in self.__lexicon__:
            words_and_phrases.append(phrase)
            i += 3
            continue
    if i + 1 < n_words:
        phrase = f'{self.words[i]} {self.words[i+1]}'

        if phrase in self.__lexicon__:
            words_and_phrases.append(phrase)
            i += 2
           continue
    if i < n_words:
        words_and_phrases.append(self.words[i])
        i += 1

self.words_and_phrases = words_and_phrases`

Regards!