replicahq / doppelganger

A Python package of tools to support population synthesizers
Apache License 2.0
165 stars 32 forks source link

Modifying inputs.py #28

Closed HLC-CUUATS closed 6 years ago

HLC-CUUATS commented 7 years ago

Hi everyone,

We are currently using doppelganger for our own set of data in our region. The example is working for us when we use our own data and the generated household table is exactly what we need. The only problem is that we do need discrete numbers for some of the categories, in our case it would be household_income and num_people (some of the values are categorical but we would need specific numbers).

We have downloaded the most recent version of doppelganger and been using it via Jupyter Notebook. In the doppelganger full example it mentions accessing inputs.py to make adjustments to output variables. After modifying the inputs.py file and running the example we noticed the outputs do not change at all. Are we suppose to modify the inputs.py file within the download we have or is there another inputs.py that we should be working with?

To clarify, we have our doppelganger location at 'C:\Users\someUser\doppelgangerCU' and we've been modifying the inputs at 'C:\Users\someUser\doppelgangerCU\doppelganger\inputs.py'.

We'd appreciate any help, thanks!

HLC-CUUATS commented 7 years ago

Update: We have figured out my issue with editing the inputs file (had to delete my current and reinstall from the specific location I was working with).

We could still use some help on how to modify inputs.py in order for more discrete values in household_income and num_people.

katbusch commented 7 years ago

Glad to hear you're using doppelganger!

To modify the discretization of num_people you can modify this function: https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/inputs.py#L84

You should be able to modify the discretization of individual_income or household_income just by changing the bins in your config: https://github.com/sidewalklabs/doppelganger/blob/master/examples/sample_data/config.json

Let me know if this helps!

HLC-CUUATS commented 7 years ago

@katbusch Thanks for the reply!

We've been modifying our income range by changing the bins in our config; we made our income increment by 10000. The issue we've been having with this is that a good amount of our genereated household output data is <=0 (~37k/78k). Have you seen examples of this problem before?

alexeisw commented 6 years ago

It is likely that your input data does not have samples of household incomes within the ranges of all the bins you requested, i.e. with 10k increments a good amount of the bins may simply got no input data, eventually causing this kind of output.

katbusch commented 6 years ago

@alexeisw I believe they're referring to too many 0-income households.

@HLC-CUUATS, is the % of households generated different from the % in the training data? I believe one current issue with Doppelganger is that currently if the training data is missing income information for a household, that will be counted as a zero-income household. So I would expect that the % of 0s should be the % of zeros in your training data + the % of rows with no data in your training data.

HLC-CUUATS commented 6 years ago

@katbusch I double checked the input data we were using and that seems to be the problem. We have a good amount of missing incomes, proportional to the amount of 0's we are getting from doppelganger. We will most likely use ACS Median values based off of tract and households size for the missing numbers. Thanks for your help!