Open gregmacfarlane opened 6 years ago
ccing my colleague @josiekre
Hi Greg, Hi Josie,
the data you are feeding into the algorithm is inconsistent, in a sense that household weights do not sum (not even approximately) to the same values as the marginals data for the total number of households. It results in the doppelganger allocation algorithm to return a population with the total number of households that sums to a value in between the two. When you decreased the total sum in controls, it has returned a population with a lower sum too, accordingly.
One option is to fix the input data to make sure weights match the marginals (can be approx). Other option is to tune the internal parameter https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/allocation.py#L170 to make it converge to either initial PUMS weights or the marginals. Hope this helps,
Alexei
Interesting; so, if I understand correctly doppelganger considers both the PUMA weighted population in PUMS and the ACS tract-level population. It seems as though there will be times when we want to synthesize a population for only part of a PUMA, meaning that the controls should (for lack of a better word) control. Or, if we are looking at an alternative population scenario, changing both the weights and the controls is duplicative.
I just cranked up the gamma parameter you identified (I will send you a PR exposing that parameter), but it of course has the consequence of substantially increasing the run time. Is there a specific reason why I would want to respect or consider the PUMA weights in this process, other than as a seed?
Sounds good. Re your question why would one use household weights other than the seed, PUMS is a solid data source by itself, and in most applications it would be enough to just use the household expansion weights it provides to have a reasonable estimate of a population statistic that match the totals at PUMA level.
BTW, if you really need to run synthesis for tracts within a lower/higher level of allocation geography that does not match PUMA, you could possibly subsample the table, or scale household weights in the table proportionally to the expected totals for that area.
Running the example notebook with the included marginals control file, I noticed a discrepancy between the
num_people_count
field and the resulting number of synthetic households.The total number of households in tracts contained in the marginals file
controls.data['num_people_count'].sum()
is 46,945. (The name of this field is also somewhat misleading, because ACS table B11016 is a table of households by number of people, not the number of people, but that's not the issue here). When I generate the population for the PUMA in the example, the resulting populationpopulation.generated_households['household_id'].count()
is 73,644. BTW, the total weighted households in the PUMS data is 97,841.I wanted to see if this error was sensitive to the marginals file. So I deleted all but the first nine tracts in the file, whittling the number of households in the included tracts to 16,889. In this case, doppelganger returned a population with 54,421 households.
Is there an additional step where I need to downsample the synthetic population to match the marginal targets? Is there something that I don't understand? I've included my script in this gist; I used the most recent commit on master, running in Python 3.