Generated population doesn't match input controls

gregmacfarlane commented 6 years ago

Running the example notebook with the included marginals control file, I noticed a discrepancy between the num_people_count field and the resulting number of synthetic households.

The total number of households in tracts contained in the marginals file controls.data['num_people_count'].sum() is 46,945. (The name of this field is also somewhat misleading, because ACS table B11016 is a table of households by number of people, not the number of people, but that's not the issue here). When I generate the population for the PUMA in the example, the resulting population population.generated_households['household_id'].count() is 73,644. BTW, the total weighted households in the PUMS data is 97,841.

I wanted to see if this error was sensitive to the marginals file. So I deleted all but the first nine tracts in the file, whittling the number of households in the included tracts to 16,889. In this case, doppelganger returned a population with 54,421 households.

Is there an additional step where I need to downsample the synthetic population to match the marginal targets? Is there something that I don't understand? I've included my script in this gist; I used the most recent commit on master, running in Python 3.

python3 accuracy.py
INFO:__main__:Loading configuration and data
INFO:__main__:Loading model
INFO:__main__:File       PUMS    Controls    Generated
INFO:__main__:sample_data/marginals_00106.csv        97841   46945   73644
INFO:__main__:sample_data/marginals_00106_modified.csv       97841   16889   54421

gregmacfarlane commented 6 years ago

ccing my colleague @josiekre

alexeisw commented 6 years ago

Hi Greg, Hi Josie,

the data you are feeding into the algorithm is inconsistent, in a sense that household weights do not sum (not even approximately) to the same values as the marginals data for the total number of households. It results in the doppelganger allocation algorithm to return a population with the total number of households that sums to a value in between the two. When you decreased the total sum in controls, it has returned a population with a lower sum too, accordingly.

One option is to fix the input data to make sure weights match the marginals (can be approx). Other option is to tune the internal parameter https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/allocation.py#L170 to make it converge to either initial PUMS weights or the marginals. Hope this helps,

Alexei

gregmacfarlane commented 6 years ago

Interesting; so, if I understand correctly doppelganger considers both the PUMA weighted population in PUMS and the ACS tract-level population. It seems as though there will be times when we want to synthesize a population for only part of a PUMA, meaning that the controls should (for lack of a better word) control. Or, if we are looking at an alternative population scenario, changing both the weights and the controls is duplicative.

I just cranked up the gamma parameter you identified (I will send you a PR exposing that parameter), but it of course has the consequence of substantially increasing the run time. Is there a specific reason why I would want to respect or consider the PUMA weights in this process, other than as a seed?

alexeisw commented 6 years ago

Sounds good. Re your question why would one use household weights other than the seed, PUMS is a solid data source by itself, and in most applications it would be enough to just use the household expansion weights it provides to have a reasonable estimate of a population statistic that match the totals at PUMA level.

BTW, if you really need to run synthesis for tracts within a lower/higher level of allocation geography that does not match PUMA, you could possibly subsample the table, or scale household weights in the table proportionally to the expected totals for that area.

replicahq / doppelganger

Generated population doesn't match input controls #66