theonaunheim / surgeo

Open Source Proxy Demographic module written in Python
MIT License
32 stars 16 forks source link

Fill NA data with population level statistics? #19

Open follperson opened 2 years ago

follperson commented 2 years ago

Hey - great idea and implementation, thanks for putting this together!

Notcing that when I'm getting probabilities with the BIFSG model, I get null no results / null probabillities if any one of my input features is either null or doesnt show up in the census data. It would be great if there were an option to override the null probabilities that get introduced in these intermediate steps.

ex: if the ZCTA is absent but First and Last name are present in the census data, then before we combine the probabilities, we fill the null zip code data with the population level statistics, and calculate the combined probabilitiy from that (alternatively we could just not include it in the calculation, not sure which is preferable). Perhaps a 'backfill with aggregate statistics' flag parameter for each of the components would be good.

Currently looks like this, but we could definitely eek out some information here instead of leaving it null: zcta5 first_name surname white black api native multiple hispanic 0 90210 RANDALL ZZZZZZ NaN NaN NaN NaN NaN NaN 1 90210 QQQQQQ AARON NaN NaN NaN NaN NaN NaN 2 99999 RANDALL AARON NaN NaN NaN NaN NaN NaN 3 90210 RANDALL AARON 0.972583 0.004928 0.000934 0.000053 0.020869 0.000633