Aggregation - Githubissues

Joshdpaul commented 4 months ago

Apologies in advance for all the black reformatting changes! You can safely ignore a lot of the diff.

The real heart of this PR (and the only major change from the previously reviewed work) is the functions.aggregate_results() function. This function takes the places with multiple geographic units (currently only JBER and Eagle River) and aggregates the values according to the guidance from the PI. The operation requires converting the values from percentages back to population counts, then summing the counts and re-computing the percentage. GEOID strings and place names of aggregated rows are also concatenated.

I also incorporated guidance from the PI about dealing with NA values. The resulting data_to_export.csv now has way more actual data values and fewer NAs. Some new race / ethnicity variables were added as well.

Same as the last PR, I called a few of the individual data fetching functions in fetch_data_and_export.ipynb and included some URL request printing to allow you to see the API requests, the JSON returned, and compare them with the reformatted function outputs.

TO TEST:

Read the updated README.md, check for clarity / typos / etc.
Run fetch_data_and_export.ipynb, paying particular attention to the testing cells in the first part of the notebook. Investigate some of the URLs, and make sure the values you see in the printed dataframes match those returned by the API. (I recently ran into some timeout issues with the API, but it seemed to resolve itself and was working normally within a few hours. If you time out, or recieve any other errors when running the run_fetch_and_merge() function in the notebook, maybe wait an hour and try again.)
Review the code for the functions.aggregate_results() function, confirm that the aggregation math makes sense.
- Review data_to_export.csv. The result should be identical to the version committed in this branch.

Joshdpaul commented 4 months ago

Hey @cstephen, thank you once again for the in-depth review!

I stumbled on the math here a bit too. Long story short, we can't average percentages to get the aggregated value. Instead we need to convert from percentage to actual count, and then sum those counts = and get a percentage of the grand total population. The math comes out slightly different, which surprised me. It seems natural to just average the percentages, and its always pretty close, but apparently not allowed.

The guidance in list item 2 here spells it out and that's what I followed.

Here is the math for the pct_no_hsdiploma variable for Eagle River:

get grand total population from total_population (4318 + 6384 + 3582 + 7421 + 3413) = grand total population = 25118
convert pct_no_hsdiploma to actual count of no hs diploma persons (4318(7.1/100)) + (6384(3.9/100)) + (3582(1.9/100)) + (7421(0.8/100)) + (3413*(0.9/100)) = 713.697
calculate aggregated pct_no_hsdiploma using count & grand total population (713.697/25118)*100 = 2.84137670197

cstephen commented 4 months ago

@Joshdpaul, that makes perfect sense and yes, now I realize why you can't simply average together percentages like that without taking into account the counts that make up the percentages. Really basic stuff actually 🤦 Thanks for thinking this through carefully. Approved!

ua-snap / epa-justice

Aggregation #7