tomaarsen / TwitchMarkovChain

Twitch Bot for generating messages based on what it learned from chat
MIT License
118 stars 25 forks source link

bot breaking up words with single quotation marks in them #29

Closed tradingaddict closed 2 years ago

tradingaddict commented 2 years ago

Trying to generate sentences with !g and words such as "I'm", "haven't", "can't" etc. doesn't work, I believe they are saved properly in the database and the bot uses them normally when it picks them itself, but it breaks if you try to put them in a !g command.

Typing "!g i've" makes the bot say "I haven't learned what to do with "i 've" yet." Typing "!g can't" makes it say "I haven't learned what to do with "ca n't" yet."

It does have data for the actual words in the database but it looks like something goes wrong with splitting the words and putting them back together when generating.

tomaarsen commented 2 years ago

Hello!

Since cf14dda39f56fd61e550a688c22cb0354c3ad7c5 I indeed split up punctuation from other tokens. The goal here was to make e.g. hello, into two tokens, rather than just one, so the bot learns how to extend from hello. Similarly, from I've it learns I followed by 've, so that it can extend I.

The issue you're seeing here is that it actually does not know how to extend ["I", "'ve"], because of a bug that converted I've into I 've when generating, but kept the word as I've when learning. I've fixed that now.

However, another small issue still exists: My "detokenizer", i.e. the function to turn ca n't back into can't, isn't always perfect:

>>> TreebankWordDetokenizer().tokenize(["Do", "n't"])
"Do n't"
>>> TreebankWordDetokenizer().tokenize(["Do", "n't", "want"])
"Don't want"
>>> TreebankWordDetokenizer().tokenize(["I", "'ve"])  
"I 've"
>>> TreebankWordDetokenizer().tokenize(["I", "'ve", "seen"])
"I've seen"

Using NLTK's TreebankWordDetokenizer. I can't really explain this at the moment, but it requires a fix from NLTK.


Thanks for reporting this!

tomaarsen commented 2 years ago

This should be fixed now in the latest release. Feel free to upgrade. Your old database won't immediately update, but new entries should be correct now.

tomaarsen commented 2 years ago

I've opened up an issue on NLTK for the tokenizing issue I mentioned: https://github.com/nltk/nltk/issues/3069.