mountainMath / cancensus

R wrapper for calling CensusMapper APIs
https://mountainmath.github.io/cancensus/index.html
Other
82 stars 15 forks source link

Better search #143

Closed dshkol closed 4 years ago

dshkol commented 4 years ago

Addresses discussion in #141

Major changes:

Query the available list of Census vectors based on their label and return details including vector code. Default search behaviour expects an exact match, but keyword or semantic searches can be used instead by setting query_type="keyword" or query_type = "semantic" instead. Keyword search is useful when looking to explore Census vectors based on broad themes like "income" or "language". Keyword search seperates the query into unigrams and returns Census vectors with matching words, ranked by incidence of matches. Semantic search is designed for more precise searches while allowing room for error for spelling or phrasing, as well as for finding closely related vector matches. Semantic search separates the query into n-grams and relies on string distance measurement using a generalized Levenshtein distance approach.

Some census vectors return population counts segmented by Female and Male populations, in addition to a total aggregate. By default, query matches will return matches for the Total aggregation, but can optionally return only the Female or Male aggregations by adding type = "female" or type = "male" as a parameter.

A few other notes:

  1. Calling find_census_vectors() with parameter query_type = 'keyword' will check frequency of unique matches of unigrams and retrieve the vectors with the highest number of matches. If there are additional matches with a lower number unique unigram matches, the function will also prompt the user with a menu to see the remaining options. The downside is that this interactivity might break using this type of search in a script. Open to changing this. Example below:
find_census_vectors('commuting duration', dataset = 'CA16', type = 'female', query_type = 'keyword')

# A tibble: 6 x 4
  vector   type   label                                          details                                        
  <chr>    <fct>  <chr>                                          <chr>                                          
1 v_CA16_… Female Total - Commuting duration for the employed l… 25% Data; Commute; Total - Commuting duration …
2 v_CA16_… Female Less than 15 minutes                           25% Data; Commute; Total - Commuting duration …
3 v_CA16_… Female 15 to 29 minutes                               25% Data; Commute; Total - Commuting duration …
4 v_CA16_… Female 30 to 44 minutes                               25% Data; Commute; Total - Commuting duration …
5 v_CA16_… Female 45 to 59 minutes                               25% Data; Commute; Total - Commuting duration …
6 v_CA16_… Female 60 minutes and over                            25% Data; Commute; Total - Commuting duration …

There are 12 additional keyword matches with less precision. Show more? 

1: Yes
2: No

Selection: 2
Showing top 6 results only
# A tibble: 6 x 4
  vector   type   label                                          details                                        
  <chr>    <fct>  <chr>                                          <chr>                                          
1 v_CA16_… Female Total - Commuting duration for the employed l… 25% Data; Commute; Total - Commuting duration …
2 v_CA16_… Female Less than 15 minutes                           25% Data; Commute; Total - Commuting duration …
3 v_CA16_… Female 15 to 29 minutes                               25% Data; Commute; Total - Commuting duration …
4 v_CA16_… Female 30 to 44 minutes                               25% Data; Commute; Total - Commuting duration …
5 v_CA16_… Female 45 to 59 minutes                               25% Data; Commute; Total - Commuting duration …
6 v_CA16_… Female 60 minutes and over                            25% Data; Commute; Total - Commuting duration …
  1. Semantic search works but it is not great. Working with nlp search is difficult, especially as I wanted to keep everything in base R and not add any nlp or string packages. It's not bad for bridging the gap a bit between exact search and searches with a bit less precision, but it's not exactly google search either.

  2. Exact search is a bit more precise than before. I've removed the fuzzy string matching because it was mostly getting false positives in the old version of the search function. The approach with query_type="semantic" is a bit more accurate for finding close matches, but I've also reduced the tolerance there as well.

  3. The default is set to exact search - can be changed to keyword search.

  4. Exact search is very fast. Keyword search is pretty fast, and semantic search is a bit slower - but all are fine on the complete census vector dataset.

I've also fixed up references to search in the documentation but need to clean it up in the vignettes etc., if we want to incorporate this PR. A vignette just for searching might be useful too.

mountainMath commented 4 years ago

I like it. We can also add a parameter ineractive=TRUE that can be set to FALSE in case one wants to script this.

I will rework things a bit on the CensusMapper side of things to link directly to the variable exploration tab for a given census geography.

dshkol commented 4 years ago

I've commented out the new muni_status function that looks up the municipality status codes because the french characters cause check issues for non-ascii characters. do you recall what we did in cansim for that?

mountainMath commented 4 years ago

Yeah, the non-ascii characters is a mess in R. For {cansim} we pasted in the UFT* codes, e.g. intToUtf8(0x00CE) for Î. It's a total pain. An alternative might be to include a CSV as data in the package and read that in? Although not sure if that would get flagged by the CRAN police too.

dshkol commented 4 years ago

See new branch dev which has both new feature branches rolled into one