mountainMath / cmhc

Wrapper for hack into CMHC data
Other
18 stars 5 forks source link

Survey Zones naming over time #10

Open bdbmax opened 1 year ago

bdbmax commented 1 year ago

Hello!

By getting the data through get_cmhc from different year, the naming of what seems to be the same survey zone can differ over time; here's an example.

plateau <- lapply(2015:2016, \(yr) {
  out <- cmhc::get_cmhc(survey = "Rms",
                        series = "Vacancy Rate",
                        dimension = "Rent Ranges",
                        breakdown = "Survey Zones",
                        geo_uid = 24462,
                        year = yr)
  out$`Survey Zones`[grepl("^Plateau", out$`Survey Zones`)]
})

print(unique(do.call(c, plateau)))

Output: [1] "Plateau Mont-Royal" "Plateau-Mont-Royal"

Naming for le Plateau in Montreal changes overtime. Before 2015 (included), there was no hyphen, and after 2015, the hyphen appeared. I believe this is the same zone, but there's no way to really be sure? From the description of the get_cmhc_geography function, it's stated that the geographic data corresponds to an extract from 2017, and that it won't necessary match regions from other years. Could a year argument be added to the get_cmhc_geography function, letting us match names to spatial polygon for every individual year? And then year over year we could match the actual zones rather than names that might differ from a single string (in the hypothetical case that this is indeed the same survey zone).

Here is another example of names differing in the data, and a zone disappearing in some years:

st_lin <- lapply(2016:2021, \(yr) {
  out <- cmhc::get_cmhc(survey = "Rms",
                        series = "Vacancy Rate",
                        dimension = "Rent Ranges",
                        breakdown = "Survey Zones",
                        geo_uid = 24462,
                        year = yr)
  out$`Survey Zones`[grepl("^Saint-Lin", out$`Survey Zones`)]
})

print(st_lin)

Output: 
[[1]]
character(0)

[[2]]
[1] "Saint-Lin\u0096Laurentides V" "Saint-Lin\u0096Laurentides V"
[3] "Saint-Lin\u0096Laurentides V" "Saint-Lin\u0096Laurentides V"
[5] "Saint-Lin\u0096Laurentides V" "Saint-Lin\u0096Laurentides V"
[7] "Saint-Lin\u0096Laurentides V"

[[3]]
character(0)

[[4]]
character(0)

[[5]]
character(0)

[[6]]
[1] "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V"
[4] "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V"
[7] "Saint-Lin-Laurentides V"

Maybe the zone just has a different naming in some years?

I think getting the survey zones geography for every year, if at all possible, would be the best way to fix these non-matching namings. These zones also have a METZONE_UID in the output of the get_cmhc_geography, which would help idenfity the zone coming from the data to the spatial zone, if that code was also in the output of the get_cmhc. But having seen the content of the httr::POST call, I understand there's only a name in that table to identify the zone; and as stated, this name isn't constant over years.

I understand CMHC data isn't super easy to work with! From your experience working with it, do you see a possibility to solve this problem? The only thing I can think of is either get spatial polygons of zones for every year (which would be very reliable), or merging years of data with names using the closest string match (less reliable).

Thanks !

mountainMath commented 1 year ago

Sigh. What a mess. Short term I think your best bet is a fuzzy match, e.g. using the fuzzyjoin package, although that comes with risks.

Medium term I could see building up a correspondence table to harmonize names across years could solve this, where one would manually inspect fuzzy match results to try and validate matches. This could then be used to automatically harmonize names under the hood.

Long term, I will check with CMHC to see if one can get better matching tables, maybe an extract linking the zone names to internal stable identifiers for each year that then can be used to build a correspondence table. Or maybe even extend the data tables returned in the HMIP to include geographic identifiers.

daniel-simeone commented 1 year ago

I've built up a correspondence table using 2016-2021 data (can easily change the date range on my function and look further), and St-Lin-des-Laurentides is the only one with any change.

And the "change" is down to encoding errors. image

The first one is an erroneous codepoint where it uses the windows-1252 code point as if it were Unicode (see: https://stackoverflow.com/questions/24500162/display-u0096-in-a-jsp

The second is the correct character, and the third is an en-dash.

Perhaps a simpler solution would be to replace \u0096 and the en-dash with the ordinary 2d code point (hyphen), with a str_replace.

It'll solve the St-Lin problem, and possibly other dash problems.