ua-snap / epa-justice

US Census and CDC data access via API
MIT License
0 stars 0 forks source link

Aggregation #7

Closed Joshdpaul closed 3 months ago

Joshdpaul commented 4 months ago

Apologies in advance for all the black reformatting changes! You can safely ignore a lot of the diff.

The real heart of this PR (and the only major change from the previously reviewed work) is the functions.aggregate_results() function. This function takes the places with multiple geographic units (currently only JBER and Eagle River) and aggregates the values according to the guidance from the PI. The operation requires converting the values from percentages back to population counts, then summing the counts and re-computing the percentage. GEOID strings and place names of aggregated rows are also concatenated.

I also incorporated guidance from the PI about dealing with NA values. The resulting data_to_export.csv now has way more actual data values and fewer NAs. Some new race / ethnicity variables were added as well.

Same as the last PR, I called a few of the individual data fetching functions in fetch_data_and_export.ipynb and included some URL request printing to allow you to see the API requests, the JSON returned, and compare them with the reformatted function outputs.

TO TEST:

Joshdpaul commented 4 months ago

Hey @cstephen, thank you once again for the in-depth review!

I stumbled on the math here a bit too. Long story short, we can't average percentages to get the aggregated value. Instead we need to convert from percentage to actual count, and then sum those counts = and get a percentage of the grand total population. The math comes out slightly different, which surprised me. It seems natural to just average the percentages, and its always pretty close, but apparently not allowed.

The guidance in list item 2 here spells it out and that's what I followed.

Here is the math for the pct_no_hsdiploma variable for Eagle River:

cstephen commented 4 months ago

@Joshdpaul, that makes perfect sense and yes, now I realize why you can't simply average together percentages like that without taking into account the counts that make up the percentages. Really basic stuff actually 🤦 Thanks for thinking this through carefully. Approved!