Open aguschin opened 2 years ago
It may be a good idea to collect all possible words from dataset and filter words by them. And default check is filter words by NLTK's words corpus (nltk.corpus.words
instead of neural networks in wordnet.synsets
) if dataset is not defined.
P.S. The similar problem is discussed here
P.P.S. Maybe using another one third-party library (such as pyenchant) is not convenient, so NLTK's corpus is good choice.
Good point, @naidenovaleksei!
One question, are there any benefits from using pyenchant for this simple task? If not, I think we can use nltk
if dataset is not defined, because we already use it for other text processing tasks here.
Found this in logs. May it be connected with wordnet not recognizing some words?
2021-07-03 13:16:10,361 - the_hat_game.loggers - INFO - EXPLAINING PLAYER (Make Hat Game Again) to HOST: my wordlist is ['opengl', 'compatibility', 'rearchitecting', 'upgrading', 'backporting', 'reimplemented', 'gtk', 'toolkits', 'rewriting', 'directx'] 2021-07-03 13:16:10,361 - the_hat_game.loggers - INFO - HOST TO EXPLAINING PLAYER (Make Hat Game Again): cleaning your word list. Now the list is ['compatibility', 'upgrading', 'rewriting']
2021-07-03 13:16:14,925 - the_hat_game.loggers - INFO - HOST to EXPLAINING PLAYER (LAZY ILON): the word is "usenet" 2021-07-03 13:16:15,207 - the_hat_game.loggers - INFO - EXPLAINING PLAYER (LAZY ILON) to HOST: my wordlist is ['newsgroups', 'nntp', 'crossposted', 'pcboard', 'bbses', 'crossposting', 'newsgroup', 'cypherpunks', 'crosspost', 'funet'] 2021-07-03 13:16:15,207 - the_hat_game.loggers - INFO - HOST TO EXPLAINING PLAYER (LAZY ILON): cleaning your word list. Now the list is ['bbses']
Looks like it is related.
You can see below that both Wordnet
dictionary and Synset
dictionary don't contain all English words. nltk.corpus.words
the same.
from nltk.corpus import wordnet
from nltk.corpus import words as nltk_words
wordlist = ['compatibility', 'rewriting', 'upgrading', 'backporting']
print("word" + " " * 4, "synsets", "wordnet", "nltk_words", sep="\t")
for word in wordlist:
word_in_wordnet = word in wordnet.words()
word_in_synsets = len(wordnet.synsets(word)) > 0
word_in_nltk_words = word in nltk_words.words()
print(word, word_in_synsets, word_in_wordnet, word_in_nltk_words, sep="\t")
> word synsets wordnet nltk_words
> compatibility True True True
> rewriting True True False
> upgrading True False False
> backporting False False False
So I think this will fix it.
Stop using
wordnet.synsets
to filter words in guessing. This is a workaround to remove non-existing words (otherwise, for example, you can explain "hatter" by "hattter".One option to solve this is to use some huge English dictionary. Other suggestions are welcomed.
The code line where this happens: https://gitlab.com/production-ml/the-hat-game/-/blob/master/the_hat_game/game.py#L59