replicahq / doppelganger

A Python package of tools to support population synthesizers
Apache License 2.0
165 stars 32 forks source link

Add work-status and educational-attainment nodes to person model #39

Closed nikisix closed 6 years ago

nikisix commented 6 years ago

Side note: temporarily removed segmentation due to growing model size. In general, do we want to use segmentation for first round pop-synths? If so, what to segment on if not the default age and household size.

Do we need to update populationgen.Population.generate to use other fields, or are the person.age, person.sex and household.num_people fields sufficient?

Example model output from running on a puma 29-00901: models.zip

Here's my first guess at the expanded model person_bn.json:

  "type": "person",
  "nodes": [
    "age",
    "sex",
    "income",
    "working",
    "education"
  ],
  "edges": {
    "age": [
      "income"
    ],
    "sex": [
      "income",
      "working"
    ],
    "working": [
        "income"
    ],
    "education": [
        "income"
    ]
  }
}

Can also see an argument for an arrow from education (attainment) to working (work status)


This change is Reviewable

katbusch commented 6 years ago

Thanks for adding these!

removed segmentation due to growing model size Can you explain exactly what you mean by this?


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


doppelganger/inputs.py, line 103 at r1 (raw file):

        return 'under-16'
    if code == '1' or code == '2' or code == '4' or code == '5':
        return 'Y'

How about something a little more descriptive than 'Y' and 'N'? 'employed' and 'unemployed'? And would be great if those were defined like:

class EmploymentStatus(object):
  UNDER_16 = 'under16'
  UNEMPLOYED = 'unemployed'
...

doppelganger/inputs.py, line 153 at r1 (raw file):

        return 'bachelors-degree'
    if code == '22' or code == '23' or code == '24':
        return 'advanced-degree'

Same here for constant definitions


Comments from Reviewable

katbusch commented 6 years ago

removed segmentation due to growing model size

Can you explain exactly what you mean by this?


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

nikisix commented 6 years ago

Definitely -- Sending None as my segmentation functions (person_segmenter below) to the create_bayes_net. I.e.

    person_training_data = SegmentedData.from_data(
        cleaned_data=persons_data,
        fields=list(configuration.person_fields),
        weight_field=inputs.PERSON_WEIGHT.name,
        segmenter=person_segmenter
    )
    person_model = BayesianNetworkModel.train(
        input_data=person_training_data,
        structure=configuration.person_structure,
        fields=configuration.person_fields
    )

Was wondering what/if segmenters make the most sense for our KC synth.


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


doppelganger/inputs.py, line 103 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
How about something a little more descriptive than 'Y' and 'N'? 'employed' and 'unemployed'? And would be great if those were defined like: ``` class EmploymentStatus(object): UNDER_16 = 'under16' UNEMPLOYED = 'unemployed' ... ```

This was prescient as I'm in the process of doing that for some other variables right now as well :) . Good suggestion!


Comments from Reviewable

nikisix commented 6 years ago

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


doppelganger/inputs.py, line 103 at r1 (raw file):

Previously, nikisix (niki six) wrote…
This was prescient as I'm in the process of doing that for some other variables right now as well :) . Good suggestion!

Also, am going with '(not)working' as '(un)employed' has strict census definitions around it and is technically a subset of what's in working.


Comments from Reviewable

katbusch commented 6 years ago

Ask @alexeisw which segmentations he'd like to see.

Still not sure what you mean by "growing model size". Is segmentation making it slow?

It's okay to send None by just leaving the segmenter argument out


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

nikisix commented 6 years ago
  1. Not slow, but the bayesian model output grows to keep track of much more conditional probs when you segment, and I was just keeping it more interpretable so I can analyze the outputs easier in the meantime.

  2. I'll be sure and ask @alexeisw what we're going with for the production run though.

  3. Yep, ultimately, I'd like to allow users to pass a function in or at least define their own in the runner script. Still thinking it through.. ideas welcome.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-1.6%) to 78.928% when pulling 5efec2ca646b26f13a9dd93be1fec3468b4f9a29 on work_status into cdbbc8ebedcf9f444b1ff764ddcf1965a6a2f508 on master.

katbusch commented 6 years ago

I see! That makes sense. A couple more comments


Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


doppelganger/inputs.py, line 126 at r2 (raw file):

def educational_attainment(code):
    ''' Educational attainment (SCHL)
        bb .N/A (less than 3 years old)

Can you explain that these are the PUMS codes (and maybe the year if they're allowed to change)


doppelganger/scripts/download_allocate_generate.py, line 24 at r2 (raw file):


def person_segmenter(x): return None  # x[inputs.AGE.name]

So instead of changing these to return None, you can just remove them completely and you'll get the same effect


Comments from Reviewable

katbusch commented 6 years ago

Oh and please add tests for the new code!


Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


Comments from Reviewable

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-1.6%) to 78.928% when pulling f16d731a85cf225d0a397d28cc23c9c26e65a689 on work_status into cdbbc8ebedcf9f444b1ff764ddcf1965a6a2f508 on master.

nikisix commented 6 years ago

Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


doppelganger/inputs.py, line 153 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Same here for constant definitions

Done.


doppelganger/inputs.py, line 126 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
Can you explain that these are the PUMS codes (and maybe the year if they're allowed to change)

Good call! Done.


doppelganger/scripts/download_allocate_generate.py, line 24 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
So instead of changing these to return None, you can just remove them completely and you'll get the same effect

Got feedback from Alexei, we're keeping them in. Cleaned up the None vals and comments


Comments from Reviewable

katbusch commented 6 years ago

Ping on tests :)


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


doppelganger/scripts/download_allocate_generate.py, line 24 at r2 (raw file):

Previously, nikisix (niki six) wrote…
Got feedback from Alexei, we're keeping them in. Cleaned up the None vals and comments

Why are you keeping them in? The functionality will remain the same if you remove these functions and the code will be cleaner


Comments from Reviewable

nikisix commented 6 years ago

Yep, just finished writing allocator's test for the margina-vehicles branch. Diving into this guy today.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


doppelganger/scripts/download_allocate_generate.py, line 24 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
Why are you keeping them in? The functionality will remain the same if you remove these functions and the code will be cleaner

I should clarify, Alexei said we will want to employ segmentation for now at least. So I'm back to the original: def person_segmenter(x): return x[inputs.AGE.name]


Comments from Reviewable

nikisix commented 6 years ago

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


doppelganger/scripts/download_allocate_generate.py, line 24 at r2 (raw file):

Previously, nikisix (niki six) wrote…
I should clarify, Alexei said we will want to employ segmentation for now at least. So I'm back to the original: `def person_segmenter(x): return x[inputs.AGE.name]`

You can find this change on the marginal-vehicles branch


Comments from Reviewable

coveralls commented 6 years ago

Coverage Status

Coverage increased (+1.6%) to 82.192% when pulling 2d38e86ab63076a568cb0306166aaefdbfbd57e9 on work_status into cdbbc8ebedcf9f444b1ff764ddcf1965a6a2f508 on master.

katbusch commented 6 years ago
:lgtm:

Review status: 0 of 4 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable