Closed Joshdpaul closed 3 months ago
Hey @cstephen, thank you once again for the in-depth review!
I stumbled on the math here a bit too. Long story short, we can't average percentages to get the aggregated value. Instead we need to convert from percentage to actual count, and then sum those counts = and get a percentage of the grand total population. The math comes out slightly different, which surprised me. It seems natural to just average the percentages, and its always pretty close, but apparently not allowed.
The guidance in list item 2 here spells it out and that's what I followed.
Here is the math for the pct_no_hsdiploma
variable for Eagle River:
get grand total population from total_population
(4318 + 6384 + 3582 + 7421 + 3413) = grand total population = 25118
convert pct_no_hsdiploma
to actual count of no hs diploma persons
(4318(7.1/100)) + (6384(3.9/100)) + (3582(1.9/100)) + (7421(0.8/100)) + (3413*(0.9/100)) = 713.697
calculate aggregated pct_no_hsdiploma
using count & grand total population
(713.697/25118)*100 = 2.84137670197
@Joshdpaul, that makes perfect sense and yes, now I realize why you can't simply average together percentages like that without taking into account the counts that make up the percentages. Really basic stuff actually 🤦 Thanks for thinking this through carefully. Approved!
Apologies in advance for all the
black
reformatting changes! You can safely ignore a lot of the diff.The real heart of this PR (and the only major change from the previously reviewed work) is the
functions.aggregate_results()
function. This function takes the places with multiple geographic units (currently only JBER and Eagle River) and aggregates the values according to the guidance from the PI. The operation requires converting the values from percentages back to population counts, then summing the counts and re-computing the percentage. GEOID strings and place names of aggregated rows are also concatenated.I also incorporated guidance from the PI about dealing with NA values. The resulting
data_to_export.csv
now has way more actual data values and fewer NAs. Some new race / ethnicity variables were added as well.Same as the last PR, I called a few of the individual data fetching functions in
fetch_data_and_export.ipynb
and included some URL request printing to allow you to see the API requests, the JSON returned, and compare them with the reformatted function outputs.TO TEST:
README.md
, check for clarity / typos / etc.fetch_data_and_export.ipynb
, paying particular attention to the testing cells in the first part of the notebook. Investigate some of the URLs, and make sure the values you see in the printed dataframes match those returned by the API. (I recently ran into some timeout issues with the API, but it seemed to resolve itself and was working normally within a few hours. If you time out, or recieve any other errors when running therun_fetch_and_merge()
function in the notebook, maybe wait an hour and try again.)functions.aggregate_results()
function, confirm that the aggregation math makes sense.data_to_export.csv
. The result should be identical to the version committed in this branch.