estebanlp commented 2 years ago

Hi Kyle,

I have a census tract (120100) at origin (2015) with NA value in the B25077_001 variable (median home value). The same census tract in 2020 is almost a perfect fit (i.e. no real change)in terms of geometry. The census tract is a military base (Forth Sam Houston in San Antonio). The census tract in 2020 has an estimated value of 208300 (although is a military base).

How is it possible to get an interpolated value for 2015 to 2020 geometries of 6296.249 if the same census tract had NA value in the origin layer? It seems that the interpolate_pw() function is drawing values from nearby census tracts in 2015?

According to realtor.com, there are several apartment units for sale in the Forth Sam Houston area, but I check that these units fall outside of the census tract geometries in 2015 and 2020. (https://www.realtor.com/realestateandhomes-search/Fort-Sam-Houston_San-Antonio_TX?abButtonId=0&view=map&pos=29.46584,-98.449341,29.46016,-98.444148,18)

The step that creates the centroid could be behind this problem. There are four centroids that fall outside of the census tract. One is in a park, but the others fall in a different census tracts with non-NA values. If the location of the centroid is used to draw the values of the 2015 layer, then this could be explaining why there is an interpolated value for 2015 in a census tract that has NAs.

I pasted below the script that reproduces the result and the issue.

Thank you for your time!

Esteban

------------------ Script starts here -------------------------------

library(tidycensus) library(data.table) library(tidyverse) library(sf)

census_api_key("...")# you must acquired your own key at http://api.census.gov/data/key_signup.html

bexar_homevalue_15 <- get_acs(geography = "tract", variables = "B25077_001", state = "TX", county = "Bexar", geometry = TRUE,year = 2015)%>% select(estimate) %>% st_transform(crs = 2847) bexar_homevalue_20 <- get_acs(geography = "tract", variables = "B25077_001", state = "TX", county = "Bexar", geometry = TRUE,year = 2020)%>% st_transform(crs = 2847)

library(tigris) bexar_blocks <- blocks(state = "TX",county = "Bexar", year = 2020)

bexar_interpolate_pw <- interpolate_pw( bexar_homevalue_15, bexar_homevalue_20, to_id = 'GEOID', extensive = TRUE, weights = bexar_blocks, weight_column = "POP20", crs = 2847 )%>%st_drop_geometry()

bexar_socioeconomic<-merge(x = bexar_homevalue_20,y = bexar_interpolate_pw,by="GEOID",sort=F)

bexar_socioeconomic$home_value_growth<-round((bexar_socioeconomic$estimate.x/bexar_socioeconomic$estimate.y)-1,2)

library(viridis) library(ggplot2) ggplot(bexar_socioeconomic,aes(fill=home_value_growth))+ geom_sf(color=NA)+ scale_fill_viridis()

there is a census tract with 3000% value growth. This is because there are census tracts like 120100 that had NA values in 2015, but non-NA values in 2020.

bexar_homevalue_15 <- get_acs(geography = "tract", variables = "B25077_001", state = "TX", county = "Bexar", geometry = TRUE,year = 2015)

ggplot()+ geom_sf(data = st_centroid(bexar_blocks[bexar_blocks$TRACTCE20=="120100",]),aes(geometry=geometry),color='blue')+ geom_sf(data = bexar_homevalue_15[bexar_homevalue_15$GEOID==48029120100,],aes(geometry=geometry),fill=NA)+ geom_sf(data = bexar_homevalue_20[bexar_homevalue_20$GEOID==48029120100,],aes(geometry=geometry),color='red',fill=NA)

Although the two census tract boundaries align quite well, the interpolated value for 2015 is 6296.249, when the 2020 value is 208300. Beyond the fact that tract 120100 is a military base (Forth Sam Houston) and hence should not be in the census (?), how the interpolate_pw() function comes up with a home value of 6296.249 based on NA values for 2015?

library(leaflet)

leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>% addPolygons(data = st_transform(bexar_blocks[bexar_blocks$TRACTCE20=="120100",],crs = 4326),fillColor = NA,weight = 0.5) %>% addCircles(data = st_centroid(st_transform(bexar_blocks[bexar_blocks$TRACTCE20=="120100",],crs = 4326)),fillColor = NA) %>% addPolygons(data=st_transform(bexar_homevalue_20[bexar_homevalue_20$GEOID==48029120100,],crs = 4326),color = NA,fillColor = NA)

I later forced that all NA values in 2015 would also be NA (Not sure if this is all correct, because I didn't check each one of the 9 NA values in 2015)

filtering the values that should be NAs

estimate_15_nas<-bexar_homevalue_15[is.na(bexar_homevalue_15$estimate),]$GEOID

View(bexar_socioeconomic[bexar_socioeconomic$GEOID%in%estimate_15_nas,]) bexar_socioeconomic[bexar_socioeconomic$GEOID%in%estimate_15_nas,]$estimate_mhv_15<-NA

bexar_socioeconomic$home_value_growth<-round((bexar_socioeconomic$estimate/bexar_socioeconomic$estimate_mhv_15)-1,2)*100

this is the correct plot

ggplot(bexar_socioeconomic,aes(fill=home_value_growth))+ geom_sf(color=NA)+ scale_fill_viridis()+ labs(title = "Bexar County Home Value Growth (HVG) by census tract",subtitle = "Based on 5 year ACS data estimates from 2015-2020",caption = "Values represent percentages of HVG")

although this map is correct for areas that should not have a home value (airport, military bases, etc.), it is probably overlooking areas that were not developed in 2015, but are new developments in 2020. I feel like there should be a option to account for this issues.

------------------ Script ends here -------------------------------

walkerke commented 2 years ago

So, this is an interesting example and I appreciate the time you've taken to go through it. A couple things:

When you are interpolating a derived metric like a median home value, you need to use extensive = FALSE rather than extensive = TRUE as you have done. This will give you a weighted mean rather than a weighted sum, which is what you have here (which doesn't make sense in this instance).
Once we interpolate using extensive = FALSE, we get a more sensible value for the location but it still is borrowing from neighboring tracts due to the issue that you raise - block centroids falling outside the blocks themselves:

While interpolate_pw() converts polygon weights like blocks to centroids internally, it also allows you to "bring your own" weight placement methodology if you want. For example, an analyst might first want to erase water area from blocks before using as interpolation weights. In this example, we can first convert the blocks to points using st_point_on_surface() which will ensure that the block points fall within the block polygons (though not necessarily in the center of them):

bexar_blocks <- blocks(state = "TX", county = "Bexar", year = 2020) %>%
  st_point_on_surface()

interpolate_pw() will then use those points as weights instead. This produces the expected result:

This raises an interesting implementation question, then, as how to proceed, and I'd welcome your thoughts. We could do one of the following:

Implement st_point_on_surface() instead of st_centroid() internally for block point placement;
Give users an option to choose between the two geometry functions for block point placement;
Add documentation showing that users can pre-process weights like this before sending to the function.

estebanlp commented 2 years ago

Hi, Thank you for taking the time to review this.

Apologies for my overview of using extensive=T instead of extensive=F
I think that implementing st_point_on_surface() instead of st_centroid() makes more sense for the general case. In my opinion, there is lots of cases where people will run in the similar problem as I have. I cannot think of a case where the bias incurred by not using the centroid will be large enough to force using st_centroid(). I imagine that bias to be a positive function to the area size of census blocks. Maybe an option to minimize this bias is to use st_point_on_surface() only for those centroids falling outside of the census tract, and use st_centroid() for the rest?
I agree that adding documentation showing that users can pre-process weights is a good idea. In this way they can either run with the default weighting scheme, or preprocess it using whatever makes more sense for their case. For example, spatial weight matrices could be used, or even more advanced metrics derived from Locally weighted regressions.

Best,

Esteban

walkerke commented 2 years ago

@estebanlp you make some very good points. I've evaluated an alternative example from my book (https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html#catchment-areas-with-buffers-and-isochrones) where the method is used to aggregate data to a buffer zone, and the two methods do return slightly different results. The primary use case for tidycensus users, however, will be similar to yours when Census units are both used for from and to, which is where you don't want the weight points to fall outside the polygon.

I think I'll implement my second proposed option above, allowing users to choose a geometry function between st_point_on_surface() and st_centroid() for the weights, defaulting to st_point_on_surface(). This will ensure that blocks with odd shapes intersecting the input shapes are more likely to be included in the weighting process.

walkerke commented 2 years ago

@estebanlp this is now implemented; I've just swapped in st_point_on_surface() for st_centroid(). I think your example is significant enough that we want to try to avoid it as much as possible. Users can always convert weight polygons to centroids themselves if they prefer centroids.

walkerke commented 2 years ago

As an aside for anyone coming across this issue: the current CRAN release actually includes a weight_placement argument that can be set to either "surface" (the default) for point-on-surface or "centroid". I figured for posterity's sake and consistency with other implementations (e.g. Esri) it'd be good to have that option.

walkerke / tidycensus

interpolate_pw() creates non-na values for geometries with NA values at origin #447

------------------ Script starts here -------------------------------

there is a census tract with 3000% value growth. This is because there are census tracts like 120100 that had NA values in 2015, but non-NA values in 2020.

I later forced that all NA values in 2015 would also be NA (Not sure if this is all correct, because I didn't check each one of the 9 NA values in 2015)

filtering the values that should be NAs

this is the correct plot

although this map is correct for areas that should not have a home value (airport, military bases, etc.), it is probably overlooking areas that were not developed in 2015, but are new developments in 2020. I feel like there should be a option to account for this issues.

------------------ Script ends here -------------------------------