Closed estebanlp closed 2 years ago
So, this is an interesting example and I appreciate the time you've taken to go through it. A couple things:
extensive = FALSE
rather than extensive = TRUE
as you have done. This will give you a weighted mean rather than a weighted sum, which is what you have here (which doesn't make sense in this instance). extensive = FALSE
, we get a more sensible value for the location but it still is borrowing from neighboring tracts due to the issue that you raise - block centroids falling outside the blocks themselves:While interpolate_pw()
converts polygon weights like blocks to centroids internally, it also allows you to "bring your own" weight placement methodology if you want. For example, an analyst might first want to erase water area from blocks before using as interpolation weights. In this example, we can first convert the blocks to points using st_point_on_surface()
which will ensure that the block points fall within the block polygons (though not necessarily in the center of them):
bexar_blocks <- blocks(state = "TX", county = "Bexar", year = 2020) %>%
st_point_on_surface()
interpolate_pw()
will then use those points as weights instead. This produces the expected result:
This raises an interesting implementation question, then, as how to proceed, and I'd welcome your thoughts. We could do one of the following:
st_point_on_surface()
instead of st_centroid()
internally for block point placement;Hi, Thank you for taking the time to review this.
extensive=T
instead of extensive=F
st_point_on_surface()
instead of st_centroid()
makes more sense for the general case. In my opinion, there is lots of cases where people will run in the similar problem as I have. I cannot think of a case where the bias incurred by not using the centroid will be large enough to force using st_centroid()
. I imagine that bias to be a positive function to the area size of census blocks. Maybe an option to minimize this bias is to use st_point_on_surface()
only for those centroids falling outside of the census tract, and use st_centroid()
for the rest?Best,
Esteban
@estebanlp you make some very good points. I've evaluated an alternative example from my book (https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html#catchment-areas-with-buffers-and-isochrones) where the method is used to aggregate data to a buffer zone, and the two methods do return slightly different results. The primary use case for tidycensus users, however, will be similar to yours when Census units are both used for from
and to
, which is where you don't want the weight points to fall outside the polygon.
I think I'll implement my second proposed option above, allowing users to choose a geometry function between st_point_on_surface()
and st_centroid()
for the weights, defaulting to st_point_on_surface()
. This will ensure that blocks with odd shapes intersecting the input shapes are more likely to be included in the weighting process.
@estebanlp this is now implemented; I've just swapped in st_point_on_surface()
for st_centroid()
. I think your example is significant enough that we want to try to avoid it as much as possible. Users can always convert weight polygons to centroids themselves if they prefer centroids.
As an aside for anyone coming across this issue: the current CRAN release actually includes a weight_placement
argument that can be set to either "surface"
(the default) for point-on-surface or "centroid"
. I figured for posterity's sake and consistency with other implementations (e.g. Esri) it'd be good to have that option.
Hi Kyle,
I have a census tract (
120100
) at origin (2015) with NA value in theB25077_001
variable (median home value). The same census tract in 2020 is almost a perfect fit (i.e. no real change)in terms of geometry. The census tract is a military base (Forth Sam Houston in San Antonio). The census tract in 2020 has an estimated value of208300
(although is a military base).How is it possible to get an interpolated value for 2015 to 2020 geometries of
6296.249
if the same census tract had NA value in the origin layer? It seems that the interpolate_pw() function is drawing values from nearby census tracts in 2015?According to realtor.com, there are several apartment units for sale in the Forth Sam Houston area, but I check that these units fall outside of the census tract geometries in 2015 and 2020. (https://www.realtor.com/realestateandhomes-search/Fort-Sam-Houston_San-Antonio_TX?abButtonId=0&view=map&pos=29.46584,-98.449341,29.46016,-98.444148,18)
The step that creates the centroid could be behind this problem. There are four centroids that fall outside of the census tract. One is in a park, but the others fall in a different census tracts with non-NA values. If the location of the centroid is used to draw the values of the 2015 layer, then this could be explaining why there is an interpolated value for 2015 in a census tract that has NAs.
I pasted below the script that reproduces the result and the issue.
Thank you for your time!
Esteban
------------------ Script starts here -------------------------------
library(tidycensus) library(data.table) library(tidyverse) library(sf)
census_api_key("...")# you must acquired your own key at http://api.census.gov/data/key_signup.html
bexar_homevalue_15 <- get_acs(geography = "tract", variables = "B25077_001", state = "TX", county = "Bexar", geometry = TRUE,year = 2015)%>% select(estimate) %>% st_transform(crs = 2847) bexar_homevalue_20 <- get_acs(geography = "tract", variables = "B25077_001", state = "TX", county = "Bexar", geometry = TRUE,year = 2020)%>% st_transform(crs = 2847)
library(tigris) bexar_blocks <- blocks(state = "TX",county = "Bexar", year = 2020)
bexar_interpolate_pw <- interpolate_pw( bexar_homevalue_15, bexar_homevalue_20, to_id = 'GEOID', extensive = TRUE, weights = bexar_blocks, weight_column = "POP20", crs = 2847 )%>%st_drop_geometry()
bexar_socioeconomic<-merge(x = bexar_homevalue_20,y = bexar_interpolate_pw,by="GEOID",sort=F)
bexar_socioeconomic$home_value_growth<-round((bexar_socioeconomic$estimate.x/bexar_socioeconomic$estimate.y)-1,2)
library(viridis) library(ggplot2) ggplot(bexar_socioeconomic,aes(fill=home_value_growth))+ geom_sf(color=NA)+ scale_fill_viridis()
there is a census tract with 3000% value growth. This is because there are census tracts like 120100 that had NA values in 2015, but non-NA values in 2020.
bexar_homevalue_15 <- get_acs(geography = "tract", variables = "B25077_001", state = "TX", county = "Bexar", geometry = TRUE,year = 2015)
ggplot()+ geom_sf(data = st_centroid(bexar_blocks[bexar_blocks$TRACTCE20=="120100",]),aes(geometry=geometry),color='blue')+ geom_sf(data = bexar_homevalue_15[bexar_homevalue_15$GEOID==48029120100,],aes(geometry=geometry),fill=NA)+ geom_sf(data = bexar_homevalue_20[bexar_homevalue_20$GEOID==48029120100,],aes(geometry=geometry),color='red',fill=NA)
Although the two census tract boundaries align quite well, the interpolated value for 2015 is 6296.249, when the 2020 value is 208300. Beyond the fact that tract 120100 is a military base (Forth Sam Houston) and hence should not be in the census (?), how the interpolate_pw() function comes up with a home value of 6296.249 based on NA values for 2015?
library(leaflet)
leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>% addPolygons(data = st_transform(bexar_blocks[bexar_blocks$TRACTCE20=="120100",],crs = 4326),fillColor = NA,weight = 0.5) %>% addCircles(data = st_centroid(st_transform(bexar_blocks[bexar_blocks$TRACTCE20=="120100",],crs = 4326)),fillColor = NA) %>% addPolygons(data=st_transform(bexar_homevalue_20[bexar_homevalue_20$GEOID==48029120100,],crs = 4326),color = NA,fillColor = NA)
I later forced that all NA values in 2015 would also be NA (Not sure if this is all correct, because I didn't check each one of the 9 NA values in 2015)
filtering the values that should be NAs
estimate_15_nas<-bexar_homevalue_15[is.na(bexar_homevalue_15$estimate),]$GEOID
View(bexar_socioeconomic[bexar_socioeconomic$GEOID%in%estimate_15_nas,]) bexar_socioeconomic[bexar_socioeconomic$GEOID%in%estimate_15_nas,]$estimate_mhv_15<-NA
bexar_socioeconomic$home_value_growth<-round((bexar_socioeconomic$estimate/bexar_socioeconomic$estimate_mhv_15)-1,2)*100
this is the correct plot
ggplot(bexar_socioeconomic,aes(fill=home_value_growth))+ geom_sf(color=NA)+ scale_fill_viridis()+ labs(title = "Bexar County Home Value Growth (HVG) by census tract",subtitle = "Based on 5 year ACS data estimates from 2015-2020",caption = "Values represent percentages of HVG")
although this map is correct for areas that should not have a home value (airport, military bases, etc.), it is probably overlooking areas that were not developed in 2015, but are new developments in 2020. I feel like there should be a option to account for this issues.
------------------ Script ends here -------------------------------