Closed dshkol closed 4 years ago
I like it. We can also add a parameter ineractive=TRUE
that can be set to FALSE
in case one wants to script this.
I will rework things a bit on the CensusMapper side of things to link directly to the variable exploration tab for a given census geography.
I've commented out the new muni_status
function that looks up the municipality status codes because the french characters cause check issues for non-ascii characters. do you recall what we did in cansim for that?
Yeah, the non-ascii characters is a mess in R. For {cansim} we pasted in the UFT* codes, e.g. intToUtf8(0x00CE)
for Î. It's a total pain. An alternative might be to include a CSV as data in the package and read that in? Although not sure if that would get flagged by the CRAN police too.
See new branch dev
which has both new feature branches rolled into one
Addresses discussion in #141
Major changes:
search_census_vectors
to return a warningexplore_census_vectors()
opens censusmapper api interactive tool in a webpagefind_census_vectors()
to replacesearch_census_vectors
. More detail below:Query the available list of Census vectors based on their label and return details including vector code. Default search behaviour expects an exact match, but keyword or semantic searches can be used instead by setting
query_type="keyword"
orquery_type = "semantic"
instead. Keyword search is useful when looking to explore Census vectors based on broad themes like "income" or "language". Keyword search seperates the query into unigrams and returns Census vectors with matching words, ranked by incidence of matches. Semantic search is designed for more precise searches while allowing room for error for spelling or phrasing, as well as for finding closely related vector matches. Semantic search separates the query into n-grams and relies on string distance measurement using a generalized Levenshtein distance approach.Some census vectors return population counts segmented by
Female
andMale
populations, in addition to a total aggregate. By default, query matches will return matches for theTotal
aggregation, but can optionally return only theFemale
orMale
aggregations by addingtype = "female"
ortype = "male"
as a parameter.A few other notes:
find_census_vectors()
with parameterquery_type = 'keyword'
will check frequency of unique matches of unigrams and retrieve the vectors with the highest number of matches. If there are additional matches with a lower number unique unigram matches, the function will also prompt the user with a menu to see the remaining options. The downside is that this interactivity might break using this type of search in a script. Open to changing this. Example below:Semantic search works but it is not great. Working with nlp search is difficult, especially as I wanted to keep everything in base R and not add any nlp or string packages. It's not bad for bridging the gap a bit between exact search and searches with a bit less precision, but it's not exactly google search either.
Exact search is a bit more precise than before. I've removed the fuzzy string matching because it was mostly getting false positives in the old version of the search function. The approach with
query_type="semantic"
is a bit more accurate for finding close matches, but I've also reduced the tolerance there as well.The default is set to exact search - can be changed to keyword search.
Exact search is very fast. Keyword search is pretty fast, and semantic search is a bit slower - but all are fine on the complete census vector dataset.
I've also fixed up references to search in the documentation but need to clean it up in the vignettes etc., if we want to incorporate this PR. A vignette just for searching might be useful too.