Few backend "issues"/questions

vincerubinetti commented 2 years ago

Some things I noticed while testing.

The search is case sensitive? "Nintendo" does not return any results but "nintendo" does. I can obviously make it case insensitive on the frontend but maybe this should be fixed on backend. Or maybe it's not really a problem, and the model distinguishes meaning between different capitalizations? If so, I wonder why the discrepancy in that specific nintendo example.
Do you want to add a special code/status for no results, like for "Nintendo"? Again, I can just check for an empty neighbors or frequency array, but perhaps its better to do on the backend? Maybe you can check whether its in the model before even running the query?
When waiting for an uncached query to return, a user can refresh or otherwise re-search, probably doubling the query on the backend. Should I do anything special to prevent that? Or is it safe (but not ideal) to just let the second duplicate query to complete.
Ultra minor, but maybe better to return array/tuple for changepoints instead of e.g. "2000-2001"?

falquaddoomi commented 2 years ago

Hey Vince, to answer your questions:

The search token is passed off verbatim to the Gensim's KeyedVectors.most_similar() method . You'd have to ask @danich1 about how the tokens in the corpus were normalized (e.g., if they were made lowercase, stemmed, etc.). I assume from the behavior you saw that they are indeed case-sensitive, but I can't say if that's the desired behavior or not. If they need to be normalized in another way, I presume that David would have to retrain the models and re-upload them. I can of course do some processing on the tokens on the backend before I search for them in the models (e.g., drop it to lowercase) but it'll have to match what's in the models.
If it makes your life easier I can inspect the response from searching the models after it returns and see if there's no results, and send you back something different. (Personally, I think it makes sense for a token that doesn't appear in the corpus to return an empty list.) AFAIK I have to run the query to see if it's in the model, so I don't think tokens that don't exist can be optimized away, unfortunately.
That's a good question...so, right now, queries that aren't cached launch a job on a task queue (RQ, specifically). I should add some logic to check if the query is already ongoing in a job and then wait on the existing job rather than launching a new one. In any case, I think that this is an issue that should be handled on the backend, since I can see what jobs are running there -- no need to handle it on the frontend.
Sure, that's a good idea; I'll return a tuple instead (which AFAIK will be a list when it's turned into JSON).

vincerubinetti commented 2 years ago

Sounds good.

Regarding the empty lists, should I consider there to be no results if any of the lists (neighbors, frequencies, umap) are empty. Or should i just hide individual charts that are empty.

this is mainly what I was asking I guess. I believe I saw one where neighbors were empty but frequency was not. Just making sure that's not a bug and that those frequency values would still be meaningful.

falquaddoomi commented 2 years ago

I think @danich1 could give a better answer here, but I assume that a word could have no neighbors but still occur, and thus have a frequency. Still, good that you flagged that as a possible issue; it deserves to be double-checked.

danich1 commented 2 years ago

The search token is passed off verbatim to the Gensim's KeyedVectors.most_similar() method . You'd have to ask @danich1 about how the tokens in the corpus were normalized (e.g., if they were made lowercase, stemmed, etc.). I assume from the behavior you saw that they are indeed case-sensitive, but I can't say if that's the desired behavior or not. If they need to be normalized in another way, I presume that David would have to retrain the models and re-upload them. I can of course do some processing on the tokens on the backend before I search for them in the models (e.g., drop it to lowercase) but it'll have to match what's in the models.

@falquaddoomi hit the nail on head here. Most of the tokens have been preprocessed to be lowercase. There shouldn't be any case sensitivity issues here.

Regarding the empty lists, should I consider there to be no results if any of the lists (neighbors, frequencies, umap) are empty. Or should i just hide individual charts that are empty.

this is mainly what I was asking I guess. I believe I saw one where neighbors were empty but frequency was not. Just making sure that's not a bug and that those frequency values would still be meaningful.

Strange... as the frequency is from the word2vec models. This means if the word is present in the model, then there has to be neighbors for the given word. Just no guarantee the neighbors have to make sense 😂 . Is this just in theory or is there some edge case I didn't realize was present?

falquaddoomi commented 2 years ago

@danich1 Neat that they've been processed to be lowercase. I suppose I should convert the query to lowercase on the backend before searching for it, then?

danich1 commented 2 years ago

@danich1 Neat that they've been processed to be lowercase. I suppose I should convert the query to lowercase on the backend before searching for it, then?

That would be ideal, then we won't have to worry about case issues.

vincerubinetti commented 2 years ago

Is this just in theory or is there some edge case I didn't realize was present?

I was pretty sure I saw it, but my memory is pretty unreliable. ~~I'll try to find an example again.~~ I just looked through all the cached words and couldn't find it, so clearly i'm misremembering something.

Either way it sounds like I should just be treating the results as blank (and thus invalid) if any of the top level response fields are blank/empty.

vincerubinetti commented 2 years ago

I believe these were all addressed.

greenelab / word-lapse

Few backend "issues"/questions #19