mikeizbicki / chajda

5 stars 1 forks source link

2021summer #2

Open mikeizbicki opened 3 years ago

mikeizbicki commented 3 years ago

We'll be working in the notes branch of the repo over the summer. Our first task is to create some more "augmenting" functions for search. In particular, there is a library called fasttext (https://fasttext.cc) that has word vectors for >100 different languages. I'd like to be able to use these word vectors for augmenting.

Currently, there is a function augment_gensim located at https://github.com/mikeizbicki/chajda/blob/notes/chajda/tsquery/augments.py#L8 This function uses word vectors provided by the gensim library, and only works in English. Your task is to create a new function augment_fasttext that does the same task using the fasttext library and works in all the supported languages.

joeybodoia commented 3 years ago

Mike,

You were right about the fasttext library producing 'similar' words that have typos/gibberish. For instance, here are 40 'similar' words to the word 'weapon':

fasttext similar words = [(0.8148843050003052, 'weapons'), (0.7256712317466736, 'weapon.The'), (0.7101617455482483, 'weapon.'), (0.7019676566123962, 'weapon-'), (0.6988153457641602, 'weopon'), (0.6949812769889832, 'weapon.It'), (0.6926062703132629, 'weaponry'), (0.6849159002304077, 'wepon'), (0.6755298376083374, 'weapon.I'), (0.6695926785469055, 'weapon.This'), (0.6620141863822937, 'weapon.In'), (0.6541140079498291, 'Weapon'), (0.6532132625579834, 'pistol'), (0.6498683094978333, 'weapo'), (0.6451406478881836, 'weapon.But'), (0.6440314650535583, 'weapon.A'), (0.6398204565048218, 'non-weapon'), (0.6261772513389587, 'weapons.This'), (0.6217378377914429, 'weopons'), (0.6203930974006653, 'weapons.The'), (0.6172155737876892, 'weapons.It'), (0.6166418194770813, 'weaponary'), (0.6125482320785522, 'weapons.A'), (0.6121256947517395, 'alt-fire'), (0.6051745414733887, 'weapons.'), (0.6039741635322571, 'handgun'), (0.6027325987815857, 'weapon-like'), (0.6017128825187683, 'wepons'), (0.5952125787734985, 'weapon.He'), (0.5930343270301819, 'bowgun'), (0.5924347639083862, 'weapons.As'), (0.5918095707893372, 'weapons.I'), (0.5915751457214355, 'arsenal'), (0.5906872153282166, 'firearm'), (0.5905217528343201, 'gun'), (0.5893937349319458, 'crossbow'), (0.5870627164840698, 'fire-arm'), (0.5837990045547485, 'weapons.And'), (0.5828394889831543, 'sub-weapon'), (0.5805409550666809, 'wielder')]

which, after lemmatizing and filtering, returns these 5 words:

['weaponthe', 'weopon', 'weaponit', 'weaponry', 'wepon']

We notice that there are a decent amount of 'good' similar words in there, e.g. 'firearm', 'weaponry', 'crossbow', 'arsenal', 'gun'.

This tells me two things:

  1. It's clear we will need to retrieve significantly more than n+1 similar words for fasttext
  2. we will need a different filtering strategy for returning the 'good' similar words

So, to handle this aspect of fasttext in the augments_fasttext function, I'm currently retrieving n*10 similar words from fasttext, lemmatizing them, and then working on a strategy for filtering the n*10 similar words down to the most 'legitimate' n words to actually return.

For this filtering strategy, I'm thinking of using a spellcheck library to check whether a 'similar' word is indeed a legitimate word in the provided language, and then return the first n 'legitimate' words . Does this seem like a reasonable approach to you? Or do you have a different filtering strategy that you would prefer?

Thanks,

Joey

mikeizbicki commented 3 years ago

So, to handle this aspect of fasttext in the augments_fasttext function, I'm currently retrieving n10 similar words from fasttext, lemmatizing them, and then working on a strategy for filtering the n10 similar words down to the most 'legitimate' n words to actually return.

Your intuition is correct, but inside the function is not the place to do this. Instead, the caller of the function would simply pass a large n value when using this function as opposed to the gensim function.

We also don't want to do any filtering of the output. It's actually a good thing that we have all of those typos in the output. The downstream task is web search, and the web has lots of misspelled words. So when we search for weapon, we also want to get webpages that contain typos like theweapon in the results, and this is how that happens.