Open stormbeforesunsetbee opened 1 year ago
Hello, I've just found a potential mistake from this new feature, in the method "words_and_phrases": it basically should continue (i.e. exit from current cycle) whenever it matches a phrase made by multiple words. As it is right now, it basically updates the index of the current word, but keeps searching for words having either 2 or 1 word, and automatically adds the single token if there's no match in the lexicon.
I'll write an example in order to be as clear as possible: Let's suppose we have:
lexicon = ['this is word', 'word'];
curr_tokens = ['this', 'is', 'word', 'this', 'is', 'word']
When it generates phrases with the "words_and_phrases" method, it will match the first "this is word", (i.e. curr_span=3), the current index will be correctly increased by 3, but it will keep looking for matches with curr_span<3 (i.e. curr_span=2 and curr_span=1), instead of resetting it (by exiting the current cycle).
This code below should solve the issue:
while i < n_words:
if i + 2 < n_words:
phrase = f'{self.words[i]} {self.words[i+1]} {self.words[i+2]}'
if phrase in self.__lexicon__:
words_and_phrases.append(phrase)
i += 3
continue
if i + 1 < n_words:
phrase = f'{self.words[i]} {self.words[i+1]}'
if phrase in self.__lexicon__:
words_and_phrases.append(phrase)
i += 2
continue
if i < n_words:
words_and_phrases.append(self.words[i])
i += 1
self.words_and_phrases = words_and_phrases`
Regards!
What's new:
Example
Text:
she's always arguing!
New:
Previous:
To see all of the possible expansions