Closed dshkol closed 4 years ago
Torn if we should search first for entire string and then do a tokenized search if result comes back empty, or just do tokenized search right away. The latter option might break previous code if it returns more cases than before, although not sure how much of an issue that is.
Yeah, I think that makes sense
I've got two tokenized searches working.
One simply breaks down the query into tokens and then finds the result with the most matches. This essentially turns things into a keyword search.
Pros: good to quickly narrow down results, bit faster Cons: may not always be the most semantically accurate results
The more complicated one breaks down the query and the vector list into ngrams and then does a string distance calculation between the ngrams in the query and the vector list and returns the ones with the smallest distance.
Pros: better at semantic search, better for handling typos, etc Cons: bit slower, doesn't do a great job identifying matches from keywords
Thoughts for implementation?
Hmm. I like both approaches. I guess it comes down to what problem it is trying to solve. I still regularly use the web interface to select variables I want. It's not ideal either, but gets me what I need fast. The combination of searching and still having the hierarchy visually easy to interpret is really nice.
So my question is, what solution would make me stop going to the web interface? Or at least reduce the web interface usage. I am guessing that either will reduce my usage of the interface. If I know what I am looking for, getting a good match from the command line is awesome. If I am fishing around and exploring variables, I will probably head to the web interface. Based on that, I think the second option would be more useful. At least to me.
Another option would be to turn the variable search into a shiny app. That way we could have the benefits of both worlds, better search and matching functionality and the interface to provide hierarchy context and an easy visual way to navigate. But I think that's overkill.
Shiny might be overkill. A similarly over-engineered approach that I'm also loath to do, but could be a long-term goal, would be to use htmlwidgets to add interactive hierarchical variable discovery. We can replicate the hierarchical radial chart on censusmapper for variable discovery. This way you get the interactivity without needing the reactivity of shiny. Calling this could trigger a browser window with the full interactive visual. Can call it explore_census_variables()
.
That could add a lot of heavy dependencies with htmltools, but we can make it so it prompts users with the option to install required packages if they call that function. Actually, I kind of like this approach.
Ok, how about this plan:
Short-term:
search_census_vectors()
but keep as is with a warningfind_census_vectors()
with an argument `option.= c("keywords","query") which allows to perform search on either keywords (tokenized unigrams) without any approximate matching, or queries (unigrams, bigrams, and trigrams) with string-distance based best matching.Med-term
explore_census_variables()
with interactive discovery, but not necessarily using shiny. An approach using htmltools with visuals and search should suffice. How about just keeping it under search_census_vectors()
and add the option parameter? We could add a third option that just does the old way of searching as an easy way to up code for backward compatibility.
As to the shiny app, I was mostly joking. But how about just linking explore_census_variables()
to the CensusMapper API tool (and make it open the variables tab). Could do the same thing for the region selection. And have the variable selection and region selection as copy-paste code up top on each of these tabs, instead of just the overview tab. And maybe clean up the GUI a bit. Seems like that would be easier and possibly just as useful.
I'd prefer to branch off into a separate function and keep the other one for backward compatibility only. If we don't care about backward compatibility then I'd just rewrite it from scratch as described above.
+1 on just opening the API tool page instead. That's simpler and can add that for the next update then.
Let's do a separate function then. It will break some old code if we replace it.
This was added in #143
Returns too many rows but
Returns nothing, not even related vectors. This is because the process for searching takes the entire string rather than tokenizing it.
We should take the search term, and tokenize it, and search against tokens rather than the complete term. Adding this is a to-do for next version.