Closed BethMattern closed 2 years ago
I suggest that we impute by averaging / median for all neighbors based on contiguous boundaries, or when that is not available, county median / mean.
This is in line with spatial data imputation recommendations. Here are two examples from ArcGIS that that I found particularly compelling:
Which fill method and which neighbors to use depend on how the filled data will ultimately be used. For example, a cartographer may want to fill polygons containing missing data to create an aesthetically pleasing map without holes. In this case, calculating the average of many spatial neighbors would be effective. A real estate analyst filling in missing data for the value of a house will use neighbors within a fixed distance and calculate their median value to avoid the influence of outliers.
When choosing the combination of type of neighborhood and fill method, consider which surrounding features would legitimately influence the features with the missing values and which fill method is least likely to bias the results of the analysis. For example, consider a local public health analyst who has childhood lead poisoning data at the census block group level, but a few of the block groups have missing data. The analyst might consider using neighboring block groups that share a border with the block group with missing data and use the maximum of the surrounding values to fill the missing data. Using contiguous block groups can be justified because they likely will contain houses of similar age, and housing age is a known risk factor for lead exposure. While using the maximum value of the surrounding block groups to fill missing values might overestimate the true level of lead poisoning, in this example, where children's health is concerned, it is better to overestimate rather than underestimate the risk.
To me, this suggests we either want the average, median, or the minimum. The minimum would allow us to cast a wider net of eligibility, but has the associated problem of perhaps underestimating the true value. It also doesn't mirror the distribution of the overall data. The average seems consistent logically. The median more tightly resembles the general distribution of the data, but tracts with nan
values could also have underlying differences in data.
ESRI also says the following:
Tobler’s law implies that the values of the missing data will be like the values of its neighbors in space and/or time. Therefore, we can use average, minimum, maximum, or median of the neighboring values to fill in the missing value. Statisticians call filling in missing values imputation or, in the case of spatial data, geoimputation.
They further state:
You also must decide how to define the set of neighbors that will be used to calculate missing values. Neighbors can be defined based on a variety of spatial relationships. You can define a fixed number of neighbors, choose all neighbors within a fixed distance, or choose neighbors that are contiguous (i.e., share a border or have corners that touch).
I examined two "types" of neighbors -- those that are contiguous and those that are within the same county (using the census geography hierarchy's "next step"). Note that for the former, if fewer than a single contiguous neighbor has income data, this will pull from the county level.
I chose not to look at all tracts within X distance of the missing census tracts for two reasons. First, I thought it might have disparate impact for urban areas, which have smaller tracts geographically, vs rural areas (so the rural incomes would be noisier than the urban ones). Secondly, and more importantly, we'd have to make assumptions about what it means to be "within X distance" -- is the centroid within X distance? is the entire tract within X distance? is any of the tract within X distance? -- and I didn't immediately see a benefit. But, happy to chat about this!
One note here is I don't see much literature on whether we should calculate the proportional "boundary overlap" and construct a weighted mean. Similarly, I don't see much on population-weighting. This suggests to me that neither of these methods are preferred by people who do a lot of spatial analysis. Similarly, I don't see much on non-geo-based nearest neighbors.
Sources:
Some analysis:
Note that I imputed 'Percent of individuals below 200% Federal Poverty Line'
In addition -- loosely speaking (since this doesn't account for donut hole analysis, and the percentiles have not been re-run), imputing would add about 100 new tracts (taking imputed median).
Moved to review -- will discuss in meeting
Tracts that don't have income data in the census data set can't qualify as a DAC so that data should be imputed using income data from the surrounding tracts.
We should only evaluate the impact of making this change on the DACs universe.