Recalled census variables

mountainMath commented 2 years ago

StatCan has recalled several variables from the 2021 data release, and this has to be propagated up in CensusMapper and the cancensus package. Similar issues arose in the 2016 release, and the CensusMapper API as well as the cansim package now have built-in functionality for data versioning and better metadata for locally cached data, which allows for targeted recall of data.

But we have not established a clear workflow for doing this. On possible workflow is:

Immediately (or as fast as possible) relevant variables in the CensusMapper database to NA and bump the API data version that is sent with every API call.
When corrected data becomes available, update the CensusMapper database and again bump the API data version.

This will

Immediately invalidate any live CensusMapper maps that use the affected variables
Automatically fix any live CensusMapper maps that use the affected variables.

What's still missing is clear logic to handle the cached data via cacensus. The package now keeps data version and metadata information which can be viewed via list_cancensus_cache(). At this point the cache needs to be invalidated manually, but the package should get updated to do this automatically. There are several ways to do this.

Invalidate all data that is tagged with the old version. This is simplest to implement, but will lead to unnecessary data downloads and might cause problems for people who carefully manage their quota.
Hard-code what data needs to be recalled into the package and update the package. This is reasonably easy to implement, but requires a package update for every data recall. And it requires the user to update the package to be effective.
Add more metadata to the server that encodes information exactly which variables or which part of the data needs to be invalidated. More complex to implement, but long-term probably the best solution.

mountainMath commented 2 years ago

A compromise would be to have the package emit a warning if the data API version changes, alerting the user to the possibility of recalled data, and a link to a script they can run to purge that data from their cache (and updated the data API version on non-recalled data). It's a compromise between implementing 2. and 3. in that the code that would go into the updated package for targeted cache purges could just be posted in the GitHub, and doing this with a warning rather than automatically keeps users in control of their cache.

dshkol commented 2 years ago

I wonder what CRAN protocol is for having something that reads from real-time updated list of data issues and clarifications in an .onload or .onattach call.

Lets say every time you load the package the first time in a session, it reads from an external list we maintain with notifications like this. Sounds like something CRAN would dislike, though.

mountainMath commented 2 years ago

Current plan is:

Implement recall API on census mapper with list of data that is recalled. It's a CSV with columns:

api_version: version of the data that is being recalled, either d. for data or g. for geographies
dataset: dataset identifier for dataset affected
level: optional geographic level (i.e. CT, DA, ...) for which the data is being recalled, applies to all levels if null
vector: optional vector that is being recalled, only applies to data and not to geographies.

This is now implemented on the server at https://censusmapper.ca/api/v1/recall.csv

The first time in each session get_census gets called it will download the newest recall data from the CensusMapper server and check against the local database to see if there is cached data that has been recalled. If yes, it emits a warning.
During each get_census call that hits the cache it checks against the recall database and emits a warning if recalled data is being queried. This could happen because either the user did not act on the initial warning, or because there is cached data from earlier versions of CensusMapper that did not keep version information and a database of cached data.
Implement a function to check for recalled data.
Implement a function to wipe all recalled data.

mountainMath commented 2 years ago

Pushed some functionality for 4. and 5. To test, first grab some data that has been recalled via:

level="CSD"
regions <- list(CMA="59933")

deciles_2021 <- find_census_vectors("decile","CA21","Total") |>
  slice(1) |>
  child_census_vectors(leaves_only = TRUE)
data_2021 <- get_census("CA21",regions=regions,vectors=setNames(deciles_2021$vector,deciles_2021$label),
           level=level)

Then list_recalled_cached_data() lists cached data that has been recalled and remove_recalled_chached_data() removes it from the local cache.

Additionally, get_recalled_database() grabs the recall data from the CensusMapper server and caches it for the duration of the session.

New functions relating to data recall are in the recalled_data.R file.

mountainMath commented 2 years ago

What's still missing is

to call list_recalled_cached_data() at the beginning of each get_census call and emit a warning if the returned nibble has a non-zero row count.
Do another check when hitting the cache in get_census, this will catch instances when data has been cached before the newer data versioning was implemented, as well as instances where the initial warning was ignored. Emit another warning if cached data used in the call has been recalled. Maybe add some colour or something to make it more visible.

mountainMath commented 2 years ago

Both 2. and 3. are implemented now.

mountainMath commented 2 years ago

Addressed in upcoming v0.5.3.

mountainMath / cancensus

Recalled census variables #179