Closed vincerubinetti closed 2 years ago
Hey Vince, to answer your questions:
KeyedVectors.most_similar()
method . You'd have to ask @danich1 about how the tokens in the corpus were normalized (e.g., if they were made lowercase, stemmed, etc.). I assume from the behavior you saw that they are indeed case-sensitive, but I can't say if that's the desired behavior or not. If they need to be normalized in another way, I presume that David would have to retrain the models and re-upload them. I can of course do some processing on the tokens on the backend before I search for them in the models (e.g., drop it to lowercase) but it'll have to match what's in the models.Sounds good.
Regarding the empty lists, should I consider there to be no results if any of the lists (neighbors, frequencies, umap) are empty. Or should i just hide individual charts that are empty.
this is mainly what I was asking I guess. I believe I saw one where neighbors were empty but frequency was not. Just making sure that's not a bug and that those frequency values would still be meaningful.
I think @danich1 could give a better answer here, but I assume that a word could have no neighbors but still occur, and thus have a frequency. Still, good that you flagged that as a possible issue; it deserves to be double-checked.
The search token is passed off verbatim to the Gensim's KeyedVectors.most_similar() method . You'd have to ask @danich1 about how the tokens in the corpus were normalized (e.g., if they were made lowercase, stemmed, etc.). I assume from the behavior you saw that they are indeed case-sensitive, but I can't say if that's the desired behavior or not. If they need to be normalized in another way, I presume that David would have to retrain the models and re-upload them. I can of course do some processing on the tokens on the backend before I search for them in the models (e.g., drop it to lowercase) but it'll have to match what's in the models.
@falquaddoomi hit the nail on head here. Most of the tokens have been preprocessed to be lowercase. There shouldn't be any case sensitivity issues here.
Regarding the empty lists, should I consider there to be no results if any of the lists (neighbors, frequencies, umap) are empty. Or should i just hide individual charts that are empty.
this is mainly what I was asking I guess. I believe I saw one where neighbors were empty but frequency was not. Just making sure that's not a bug and that those frequency values would still be meaningful.
Strange... as the frequency is from the word2vec models. This means if the word is present in the model, then there has to be neighbors for the given word. Just no guarantee the neighbors have to make sense 😂 . Is this just in theory or is there some edge case I didn't realize was present?
@danich1 Neat that they've been processed to be lowercase. I suppose I should convert the query to lowercase on the backend before searching for it, then?
@danich1 Neat that they've been processed to be lowercase. I suppose I should convert the query to lowercase on the backend before searching for it, then?
That would be ideal, then we won't have to worry about case issues.
Is this just in theory or is there some edge case I didn't realize was present?
I was pretty sure I saw it, but my memory is pretty unreliable. I'll try to find an example again. I just looked through all the cached words and couldn't find it, so clearly i'm misremembering something.
Either way it sounds like I should just be treating the results as blank (and thus invalid) if any of the top level response fields are blank/empty.
I believe these were all addressed.
Some things I noticed while testing.