Better vector search handling of key words

dshkol commented 4 years ago

search_census_vectors("income","CA16")

Returns too many rows but

search_census_vectors("median household income","CA16")

Returns nothing, not even related vectors. This is because the process for searching takes the entire string rather than tokenizing it.

search_census_vectors <- function(searchterm, dataset, type=NA, ...) {
  #to do: add caching of vector list here
  veclist <- list_census_vectors(dataset, ...)
  result <- veclist[grep(searchterm, veclist$label, ignore.case = TRUE),]
... lots of other code

We should take the search term, and tokenize it, and search against tokens rather than the complete term. Adding this is a to-do for next version.

mountainMath commented 4 years ago

Torn if we should search first for entire string and then do a tokenized search if result comes back empty, or just do tokenized search right away. The latter option might break previous code if it returns more cases than before, although not sure how much of an issue that is.

dshkol commented 4 years ago

Yeah, I think that makes sense

dshkol commented 4 years ago

I've got two tokenized searches working.

One simply breaks down the query into tokens and then finds the result with the most matches. This essentially turns things into a keyword search.

Pros: good to quickly narrow down results, bit faster Cons: may not always be the most semantically accurate results

The more complicated one breaks down the query and the vector list into ngrams and then does a string distance calculation between the ngrams in the query and the vector list and returns the ones with the smallest distance.

Pros: better at semantic search, better for handling typos, etc Cons: bit slower, doesn't do a great job identifying matches from keywords

Thoughts for implementation?

mountainMath commented 4 years ago

Hmm. I like both approaches. I guess it comes down to what problem it is trying to solve. I still regularly use the web interface to select variables I want. It's not ideal either, but gets me what I need fast. The combination of searching and still having the hierarchy visually easy to interpret is really nice.

So my question is, what solution would make me stop going to the web interface? Or at least reduce the web interface usage. I am guessing that either will reduce my usage of the interface. If I know what I am looking for, getting a good match from the command line is awesome. If I am fishing around and exploring variables, I will probably head to the web interface. Based on that, I think the second option would be more useful. At least to me.

Another option would be to turn the variable search into a shiny app. That way we could have the benefits of both worlds, better search and matching functionality and the interface to provide hierarchy context and an easy visual way to navigate. But I think that's overkill.

dshkol commented 4 years ago

Shiny might be overkill. A similarly over-engineered approach that I'm also loath to do, but could be a long-term goal, would be to use htmlwidgets to add interactive hierarchical variable discovery. We can replicate the hierarchical radial chart on censusmapper for variable discovery. This way you get the interactivity without needing the reactivity of shiny. Calling this could trigger a browser window with the full interactive visual. Can call it explore_census_variables().

That could add a lot of heavy dependencies with htmltools, but we can make it so it prompts users with the option to install required packages if they call that function. Actually, I kind of like this approach.

Ok, how about this plan:

Short-term:

soft deprecate search_census_vectors() but keep as is with a warning
add find_census_vectors() with an argument `option.= c("keywords","query") which allows to perform search on either keywords (tokenized unigrams) without any approximate matching, or queries (unigrams, bigrams, and trigrams) with string-distance based best matching.

Med-term

build explore_census_variables() with interactive discovery, but not necessarily using shiny. An approach using htmltools with visuals and search should suffice.

mountainMath commented 4 years ago

How about just keeping it under search_census_vectors() and add the option parameter? We could add a third option that just does the old way of searching as an easy way to up code for backward compatibility.

As to the shiny app, I was mostly joking. But how about just linking explore_census_variables() to the CensusMapper API tool (and make it open the variables tab). Could do the same thing for the region selection. And have the variable selection and region selection as copy-paste code up top on each of these tabs, instead of just the overview tab. And maybe clean up the GUI a bit. Seems like that would be easier and possibly just as useful.

dshkol commented 4 years ago

I'd prefer to branch off into a separate function and keep the other one for backward compatibility only. If we don't care about backward compatibility then I'd just rewrite it from scratch as described above.

+1 on just opening the API tool page instead. That's simpler and can add that for the next update then.

mountainMath commented 4 years ago

Let's do a separate function then. It will break some old code if we replace it.

dshkol commented 4 years ago

This was added in #143

mountainMath / cancensus

Better vector search handling of key words #141