Open mikeizbicki opened 3 years ago
Mike,
You were right about the fasttext library producing 'similar' words that have typos/gibberish. For instance, here are 40 'similar' words to the word 'weapon':
fasttext similar words = [(0.8148843050003052, 'weapons'), (0.7256712317466736, 'weapon.The'), (0.7101617455482483, 'weapon.'), (0.7019676566123962, 'weapon-'), (0.6988153457641602, 'weopon'), (0.6949812769889832, 'weapon.It'), (0.6926062703132629, 'weaponry'), (0.6849159002304077, 'wepon'), (0.6755298376083374, 'weapon.I'), (0.6695926785469055, 'weapon.This'), (0.6620141863822937, 'weapon.In'), (0.6541140079498291, 'Weapon'), (0.6532132625579834, 'pistol'), (0.6498683094978333, 'weapo'), (0.6451406478881836, 'weapon.But'), (0.6440314650535583, 'weapon.A'), (0.6398204565048218, 'non-weapon'), (0.6261772513389587, 'weapons.This'), (0.6217378377914429, 'weopons'), (0.6203930974006653, 'weapons.The'), (0.6172155737876892, 'weapons.It'), (0.6166418194770813, 'weaponary'), (0.6125482320785522, 'weapons.A'), (0.6121256947517395, 'alt-fire'), (0.6051745414733887, 'weapons.'), (0.6039741635322571, 'handgun'), (0.6027325987815857, 'weapon-like'), (0.6017128825187683, 'wepons'), (0.5952125787734985, 'weapon.He'), (0.5930343270301819, 'bowgun'), (0.5924347639083862, 'weapons.As'), (0.5918095707893372, 'weapons.I'), (0.5915751457214355, 'arsenal'), (0.5906872153282166, 'firearm'), (0.5905217528343201, 'gun'), (0.5893937349319458, 'crossbow'), (0.5870627164840698, 'fire-arm'), (0.5837990045547485, 'weapons.And'), (0.5828394889831543, 'sub-weapon'), (0.5805409550666809, 'wielder')]
which, after lemmatizing and filtering, returns these 5 words:
['weaponthe', 'weopon', 'weaponit', 'weaponry', 'wepon']
We notice that there are a decent amount of 'good' similar words in there, e.g. 'firearm', 'weaponry', 'crossbow', 'arsenal', 'gun'.
This tells me two things:
So, to handle this aspect of fasttext in the augments_fasttext
function, I'm currently retrieving n*10
similar words from fasttext, lemmatizing them, and then working on a strategy for filtering the n*10
similar words down to the most 'legitimate' n
words to actually return.
For this filtering strategy, I'm thinking of using a spellcheck library to check whether a 'similar' word is indeed a legitimate word in the provided language, and then return the first n
'legitimate' words .
Does this seem like a reasonable approach to you? Or do you have a different filtering strategy that you would prefer?
Thanks,
Joey
So, to handle this aspect of fasttext in the augments_fasttext function, I'm currently retrieving n10 similar words from fasttext, lemmatizing them, and then working on a strategy for filtering the n10 similar words down to the most 'legitimate' n words to actually return.
Your intuition is correct, but inside the function is not the place to do this. Instead, the caller of the function would simply pass a large n
value when using this function as opposed to the gensim function.
We also don't want to do any filtering of the output. It's actually a good thing that we have all of those typos in the output. The downstream task is web search, and the web has lots of misspelled words. So when we search for weapon
, we also want to get webpages that contain typos like theweapon
in the results, and this is how that happens.
We'll be working in the
notes
branch of the repo over the summer. Our first task is to create some more "augmenting" functions for search. In particular, there is a library called fasttext (https://fasttext.cc) that has word vectors for >100 different languages. I'd like to be able to use these word vectors for augmenting.Currently, there is a function
augment_gensim
located at https://github.com/mikeizbicki/chajda/blob/notes/chajda/tsquery/augments.py#L8 This function uses word vectors provided by the gensim library, and only works in English. Your task is to create a new functionaugment_fasttext
that does the same task using the fasttext library and works in all the supported languages.