Open bdbmax opened 1 year ago
Sigh. What a mess. Short term I think your best bet is a fuzzy match, e.g. using the fuzzyjoin package, although that comes with risks.
Medium term I could see building up a correspondence table to harmonize names across years could solve this, where one would manually inspect fuzzy match results to try and validate matches. This could then be used to automatically harmonize names under the hood.
Long term, I will check with CMHC to see if one can get better matching tables, maybe an extract linking the zone names to internal stable identifiers for each year that then can be used to build a correspondence table. Or maybe even extend the data tables returned in the HMIP to include geographic identifiers.
I've built up a correspondence table using 2016-2021 data (can easily change the date range on my function and look further), and St-Lin-des-Laurentides is the only one with any change.
And the "change" is down to encoding errors.
The first one is an erroneous codepoint where it uses the windows-1252 code point as if it were Unicode (see: https://stackoverflow.com/questions/24500162/display-u0096-in-a-jsp
The second is the correct character, and the third is an en-dash.
Perhaps a simpler solution would be to replace \u0096 and the en-dash with the ordinary 2d code point (hyphen), with a str_replace.
It'll solve the St-Lin problem, and possibly other dash problems.
Hello!
By getting the data through
get_cmhc
from different year, the naming of what seems to be the same survey zone can differ over time; here's an example.Naming for le Plateau in Montreal changes overtime. Before 2015 (included), there was no hyphen, and after 2015, the hyphen appeared. I believe this is the same zone, but there's no way to really be sure? From the description of the
get_cmhc_geography
function, it's stated that the geographic data corresponds to an extract from 2017, and that it won't necessary match regions from other years. Could a year argument be added to theget_cmhc_geography
function, letting us match names to spatial polygon for every individual year? And then year over year we could match the actual zones rather than names that might differ from a single string (in the hypothetical case that this is indeed the same survey zone).Here is another example of names differing in the data, and a zone disappearing in some years:
Maybe the zone just has a different naming in some years?
I think getting the survey zones geography for every year, if at all possible, would be the best way to fix these non-matching namings. These zones also have a
METZONE_UID
in the output of theget_cmhc_geography
, which would help idenfity the zone coming from the data to the spatial zone, if that code was also in the output of theget_cmhc
. But having seen the content of thehttr::POST
call, I understand there's only a name in that table to identify the zone; and as stated, this name isn't constant over years.I understand CMHC data isn't super easy to work with! From your experience working with it, do you see a possibility to solve this problem? The only thing I can think of is either get spatial polygons of zones for every year (which would be very reliable), or merging years of data with names using the closest string match (less reliable).
Thanks !