usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
136 stars 45 forks source link

Research # of thresholds exceeded data to decide if it is useful/valid to visualize this on the map #1268

Closed BethMattern closed 2 years ago

BethMattern commented 2 years ago

We are going to consider this in a few ways:

  1. How do categories look, in terms of correlation within and between?
  2. How does each indicator look, in terms of correlation between any two and among the whole dataset?
  3. How do different measures of counting up thresholds prioritize / deprioritize certain areas? Are rural areas still represented effectively?

In addition, we might want to consider how the "next" group would be included. How would these correlations shift under other thresholds? How does jaccard similarity change over different cuts of the data.

CEQ's intern Gianna has started to look a little bit at correlation between individual indicators. I am interested in continuing this work.

BethMattern commented 2 years ago

Note from @emma-nechamkin - if you are in the 99th percentile for all categories of our environmental indicators, but 64th percentile for low income, you WONT be a DAC... but if you're in 90th percentile for just one and 65th percentile for low income, you WILL be a DAC. I think we can look into this a bit more when we analyze how many thresholds a tract exceeds.

emma-nechamkin commented 2 years ago

A few high-level notes here

  1. Categories tracks some, but not all, metrics of disadvantage. We should discuss in a meeting. Screen Shot 2022-05-03 at 3.42.50 PM.png

  2. Categories are not super correlated (for 1, 0) but certain indicators ARE (e.g., income and housing burden are quite correlated). --> might want to look at PCA / LDA

Note that all of the analysis below does not include donut hole DACs.

emma-nechamkin commented 2 years ago

Another question that we had pertained to "just below the threshold" tracts. I think (preliminarily) that this issue is overblown in our collective imagination.

There are always going to be tracts that fall just outside of whatever boundary we set, by nature of setting any boundary. With that in mind, we can look at the number of tracts that fall between 80th and 90th percentile for our indicators, the number of tracts that fall between 80th and 90th percentile for our indicators and are low income, and the number of tracts that fall between 80th and 90th percentile for our indicators AND are low income AND are not already identified by the tool (narwhal). TL;DR -- most of these boundary tracts are already included.

With that in mind, we can look at a few distributions, which adds a wrinkle / complication here. The "would be" inclusions "exceed threshold count" is shifted ever so slightly to the right. However, at least preliminarily, I'm not sure how big of an impact this has.

These distributions will be in a research notebook posted soon.

  1. The number of thresholds for score narwhal exceeded
  2. The number of thresholds (80th+, no income or narwhal) exceeded
  3. The number of thresholds for 80th+ that would be eligible by income and are not included in narwhal

In re "the next 10%" -- there are a few indicators for which there appear to be disadvantaged tracts in the next 10% (e.g., diabetes, pm2.5), but the point remains that MOST tracts are already included.

In addition, we can also look at "share already flagged" and it's universally quite high.

emma-nechamkin commented 2 years ago

Even most rudimentary scales for totaling categories on average better reflect metrics of underlying disadvantage. Consider graph below -- blue line is adjusted as: sum(category in territory / count positive for category) * max(new sum) / max (straight sum).

Screen Shot 2022-05-04 at 2.29.59 PM.png

This regression suggests that even this basic new sum may better represent underlying burden.