production-ml / the-hat-game

Learn word embedding and service deployment by playing the Hat Game
MIT License
9 stars 3 forks source link

Check if the word exists in English language #1

Open aguschin opened 2 years ago

aguschin commented 2 years ago

Stop using wordnet.synsets to filter words in guessing. This is a workaround to remove non-existing words (otherwise, for example, you can explain "hatter" by "hattter".

One option to solve this is to use some huge English dictionary. Other suggestions are welcomed.

The code line where this happens: https://gitlab.com/production-ml/the-hat-game/-/blob/master/the_hat_game/game.py#L59

naidenovaleksei commented 2 years ago

It may be a good idea to collect all possible words from dataset and filter words by them. And default check is filter words by NLTK's words corpus (nltk.corpus.words instead of neural networks in wordnet.synsets) if dataset is not defined. P.S. The similar problem is discussed here P.P.S. Maybe using another one third-party library (such as pyenchant) is not convenient, so NLTK's corpus is good choice.

aguschin commented 2 years ago

Good point, @naidenovaleksei! One question, are there any benefits from using pyenchant for this simple task? If not, I think we can use nltk if dataset is not defined, because we already use it for other text processing tasks here.

aguschin commented 2 years ago

Found this in logs. May it be connected with wordnet not recognizing some words?

2021-07-03 13:16:10,361 - the_hat_game.loggers - INFO - EXPLAINING PLAYER (Make Hat Game Again) to HOST: my wordlist is ['opengl', 'compatibility', 'rearchitecting', 'upgrading', 'backporting', 'reimplemented', 'gtk', 'toolkits', 'rewriting', 'directx'] 2021-07-03 13:16:10,361 - the_hat_game.loggers - INFO - HOST TO EXPLAINING PLAYER (Make Hat Game Again): cleaning your word list. Now the list is ['compatibility', 'upgrading', 'rewriting']

2021-07-03 13:16:14,925 - the_hat_game.loggers - INFO - HOST to EXPLAINING PLAYER (LAZY ILON): the word is "usenet" 2021-07-03 13:16:15,207 - the_hat_game.loggers - INFO - EXPLAINING PLAYER (LAZY ILON) to HOST: my wordlist is ['newsgroups', 'nntp', 'crossposted', 'pcboard', 'bbses', 'crossposting', 'newsgroup', 'cypherpunks', 'crosspost', 'funet'] 2021-07-03 13:16:15,207 - the_hat_game.loggers - INFO - HOST TO EXPLAINING PLAYER (LAZY ILON): cleaning your word list. Now the list is ['bbses']

naidenovaleksei commented 2 years ago

Looks like it is related.

You can see below that both Wordnet dictionary and Synset dictionary don't contain all English words. nltk.corpus.words the same.

from nltk.corpus import wordnet
from nltk.corpus import words as nltk_words

wordlist = ['compatibility', 'rewriting', 'upgrading', 'backporting']
print("word" + " " * 4, "synsets", "wordnet", "nltk_words", sep="\t")
for word in wordlist:
    word_in_wordnet = word in wordnet.words()
    word_in_synsets = len(wordnet.synsets(word)) > 0
    word_in_nltk_words = word in nltk_words.words()
    print(word, word_in_synsets, word_in_wordnet, word_in_nltk_words, sep="\t")
> word      synsets wordnet nltk_words
> compatibility True    True    True
> rewriting True    True    False
> upgrading True    False   False
> backporting   False   False   False

So I think this will fix it.