usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
130 stars 42 forks source link

Task: Examining correlation between existing CEJST thresholds #1478

Open giannatparisi opened 2 years ago

giannatparisi commented 2 years ago

Describe the task

So far I have made a correlation matrix for the CEJST thresholds and have begun the process of collapsing thresholds by category.

The NRI dataset – agloss, poploss, buildloss – seems pretty disjoint from the other indicators compared to what it “picks up”. This is true, but to a lesser extent, for the pollution indicators, like RMP sites. They don’t seem to be strongly correlated outside of other environmental indicators. This is not the case for health, T&WD, etc. indicators

Additionally, there are no mildly or strongly significant negative correlations. This is as expected -- we are working with environmental and economic indicators here that are proven to be significantly positively correlated.

Acceptance Criteria

emma-nechamkin commented 2 years ago

thanks so much, Gianna! I'll take a look at this tomorrow.

emma-nechamkin commented 2 years ago

Gianna, really nice work to start!

For the next task, I think you want to construct a NEW field rather than mapping the max. Using dplyr, I think you will want something like:

df %>% 
    rowise() %>% # you could also groupby census tract, which makes less intuitive sense but is faster
    mutate(max_var_in_this_group = max(var1, var2, var3,..., var_n)

You might have to do something for rows with all nan(or you can drop them).

This will produce a new variable that is the maximum value across all included variables.

The code you wrote above does not create a new variable.

adwel81 commented 2 years ago

This is great! Curious if it's possible to include the two "screening" indicators in here too - Low Income (% below 200% FPL) and Higher Educational Enrollment. Also, is your analysis at the national or state level? I wonder if correlation results may be different at the state level. I was looking at the relationships between Educational Enrollment, Educational Attainment and Age for New York State this morning and found interesting relationships between the age distribution and educational enrollment, and between educational enrollment vs. High School attainment, posted in https://github.com/usds/justice40-tool/issues/1509 Seems super valuable to keep looking - Thanks!