replicahq / doppelganger

A Python package of tools to support population synthesizers
Apache License 2.0
165 stars 32 forks source link

Accuracy #44

Closed nikisix closed 6 years ago

nikisix commented 7 years ago

Tests the difference between the pums-to-marginal data and the generated-to-marginal data. A few default metrics have been implemented and the class is extensible to additional metrics. Variables compared must be shared amongst the three sources and run through the Allocator.


This change is Reviewable

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 83.723% when pulling e10c225d90205e037dbcb67f168b049cc17ed647 on accuracy into 5b8d476810de0c41e7abe7493fe31a183fc991aa on master.

katbusch commented 7 years ago

Thanks Nick! Since this a big chunk of new lib code, I have first commented on the API itself and we can perfect that before moving on to reviewing the implementation.

Next time you're writing a big chunk of new lib code, I'd recommend just sending out the function signatures first, without the code actually written, to get feedback on that before you delve into the implementation


Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.


doppelganger/accuracy.py, line 77 at r1 (raw file):

        '''
        try:
            df_person = pd.read_csv(person_pums_filepath)

Instead of raw dataframes, can we use PumsData, Marginal, and Population classes? This will make the documentation easier to parse & make it easier (although not TONS easier) if we ever want to change the storage format or implementation in the future


doppelganger/accuracy.py, line 99 at r1 (raw file):


        variables = dict()
        if marginal_variables == 'all' or marginal_variables == ['all']:

Two things: 1) I'd prefer use_all_marginals as a separate boolean. Otherwise 'all' is a bit of a magic incantation that static analyzers can't recognize & we can't easily modify. 2) For python default params that are non-primitive types, like [] and {}, you have to set the default to None and then have marginal_variablies = marginal_variables or [] in your code, otherwise you'll get this behavior:

In [1]: def test_default_param(param=[]):
   ...:     param.append('test')
   ...:     return param
   ...: 

In [2]: test_default_param()
Out[2]: ['test']

In [3]: test_default_param()
Out[3]: ['test', 'test']

doppelganger/accuracy.py, line 177 at r1 (raw file):


    @staticmethod
    def error_report(state_puma, data_dir,

I recommend you separate the report logic from the reading logic. What if you want an error report straight from CSVs?


doppelganger/accuracy.py, line 188 at r1 (raw file):

                state_puma['29'] = ['00901', '00902']
            variables (list(str)): vars to run error on. must be defined in marginals.py
            statistic (str): must be an implemented error statistic:

Instead of having this as a param, what if we just give all three in the report?


doppelganger/accuracy.py, line 204 at r1 (raw file):

                accuracy = Accuracy.from_data_dir(state, puma, data_dir)
                df = accuracy._comparison_dataframe(variables)
                if statistic == 'mean_absolute_pct_error':

Should this code be using the mean_absolute_pct_error function?


Comments from Reviewable

katbusch commented 7 years ago

Review status: 0 of 6 files reviewed at latest revision, 6 unresolved discussions.


doppelganger/accuracy.py, line 31 at r1 (raw file):


class Accuracy(object):
    def __init__(self, person_pums, household_pums, marginals,

I like the design of storing this data and having error reports as non-mutating functions on the data!


Comments from Reviewable

nikisix commented 7 years ago

Good to know! I'll do that moving forward :)


Comments from Reviewable

nikisix commented 7 years ago

Review status: 0 of 6 files reviewed at latest revision, 6 unresolved discussions.


doppelganger/accuracy.py, line 99 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Two things: 1) I'd prefer use_all_marginals as a separate boolean. Otherwise 'all' is a bit of a magic incantation that static analyzers can't recognize & we can't easily modify. 2) For python default params that are non-primitive types, like `[]` and `{}`, you have to set the default to `None` and then have `marginal_variablies = marginal_variables or []` in your code, otherwise you'll get this behavior: ``` In [1]: def test_default_param(param=[]): ...: param.append('test') ...: return param ...: In [2]: test_default_param() Out[2]: ['test'] In [3]: test_default_param() Out[3]: ['test', 'test'] ```

I hear you on the magic-all. I'll add the additional boolean. Nothing in the code modifies marginal_variables. Is there no good way to supply a default list in the signature?


doppelganger/accuracy.py, line 177 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
I recommend you separate the report logic from the reading logic. What if you want an error report straight from CSVs?

They're already separate in terms of the core class. The error_report is just a helper function that must do both to avoid loading more dataframes into memory simultaneously than necessary. If you want a report strait from a puma's worth of csv's, then use the (single puma) error metric calls. I hope the examples below will clarify things a bit..

Ex 1. Error Metric on an individual PUMA In this example, the doppelganger user is iterating through a set of pumas to generate synths, and then running validation right away in the inner loop.

six@labs ~/code/doppelganger [accuracy *]
± % python run/kc-run.py                                        !3308
Generating population for state and puma: 29 00901

Generation accuracy for state and puma: 29 00901
(0.28142118520924531, 0.1826337156575254)

Generating population for state and puma: 29 00902

Generation accuracy for state and puma: 29 00902
(0.25311483219466768, 0.18961370703110594)

Ex 2. Error Report on a group of PUMAs By contrast, in this example the user has already created a city's worth of synths, and wants a cohesive report on the entire city's generation efforts

Error by PUMA
                                    marginal-pums    marginal-generated
(state-id, puma-id)       0.248649            0.227985
(state-id, puma-id)       0.253973            0.197437
(state-id, puma-id)       0.282765            0.240902
(state-id, puma-id)       0.291534            0.238600
(state-id, puma-id)       0.255064            0.234489
(state-id, puma-id)       0.294397            0.216129
(state-id, puma-id)       0.242665            0.188342
(state-id, puma-id)       0.256504            0.197786
(state-id, puma-id)       0.281421            0.177945
(state-id, puma-id)       0.253115            0.188211
(state-id, puma-id)       0.212845            0.138446
(state-id, puma-id)       0.277927            0.300982
(state-id, puma-id)       0.301997            0.248907
(state-id, puma-id)       0.267541            0.191033
(state-id, puma-id)       0.284612            0.215475
(state-id, puma-id)       0.246920            0.192935
(state-id, puma-id)       0.246494            0.164353

Error by Variable
                    marginal-pums  marginal-generated
(num_people, 1)          0.076596            0.064411
(num_people, 3)          0.128732            0.109631
(num_people, 2)          0.049208            0.037398
(num_people, 4+)         0.097949            0.084882
(num_vehicles, 1)        0.281352            0.221551
(num_vehicles, 0)        1.436884            0.718955
(num_vehicles, 2)        0.305796            0.322032
(num_vehicles, 3+)       0.599720            0.611557
(age, 0-17)              0.068063            0.103417
(age, 18-34)             0.031325            0.131706
(age, 65+)               0.072755            0.059312
(age, 35-64)             0.026979            0.048060

Average: by PUMA, by Variable
marginal-pums         0.264613
marginal-generated    0.209409

doppelganger/accuracy.py, line 188 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Instead of having this as a param, what if we just give all three in the report?

Eventually, there could be many more than three error metrics implemented (the idea was easy extensibility), and may clutter the report with all possible error metrics.


doppelganger/accuracy.py, line 204 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Should this code be using the `mean_absolute_pct_error` function?

I definitely get where you're coming from. Can't right now b/c the "mean" is left off while building the pums and generated report dataframes. Then this mean is actually run on both axis, summarizing mean-by-variable and mean-by-puma error (Ex. 2 above). But I could make some small calc functions (pre-mean) and route both functions through them to make things a bit more cohesive.


Comments from Reviewable

nikisix commented 6 years ago

Review status: 0 of 7 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


doppelganger/accuracy.py, line 77 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Instead of raw dataframes, can we use PumsData, Marginal, and Population classes? This will make the documentation easier to parse & make it easier (although not TONS easier) if we ever want to change the storage format or implementation in the future

Done. see from_doppelganger


doppelganger/accuracy.py, line 99 at r1 (raw file):

Previously, nikisix (niki six) wrote…
I hear you on the magic-all. I'll add the additional boolean. Nothing in the code modifies marginal_variables. Is there no good way to supply a default list in the signature?

Done.


Comments from Reviewable

katbusch commented 6 years ago

Sorry for the delay here! Coming along nicely A few more comments.


Review status: 0 of 7 files reviewed at latest revision, 12 unresolved discussions.


doppelganger/accuracy.py, line 77 at r1 (raw file):

Previously, nikisix (niki six) wrote…
Done. see from_doppelganger

Lovely


doppelganger/accuracy.py, line 99 at r1 (raw file):

Nothing in the code modifies marginal_variables. Is there no good way to supply a default list in the signature?

Nope. But setting it to None and checking is a very standard python. Even if you don't modify it, somebody else might modify it in a future change and not realize that it's set in the parameters and will affect future calls. You could set it to () since that's immutable


doppelganger/accuracy.py, line 188 at r1 (raw file):

Previously, nikisix (niki six) wrote…
Eventually, there could be many more than three error metrics implemented (the idea was easy extensibility), and may clutter the report with all possible error metrics.

Okay, given that it's a param, let's make it an enum or constants defined as class variables in a separate class so that it's clear to the user which possible values they can pass to the function. For instance, it looks like you already changed the possible values & the documentation is out of date. If you just point to a class that lists options that won't happen


doppelganger/accuracy.py, line 17 at r2 (raw file):


FILE_PATTERN = 'state_{}_puma_{}_{}'
log = logging.getLogger('doppelganger.accuracy')

We don't usually set loggers in lib code. Each script running this might want its own logging configuration. Remove this.


doppelganger/accuracy.py, line 151 at r2 (raw file):

            variables['num_people'].remove('count')

        d = OrderedDict()

Avoid single-letter variable names. What is this dict supposed to hold? Name it for that


doppelganger/accuracy.py, line 178 at r2 (raw file):

        return pd.DataFrame(list(d.values()), index=d.keys(), columns=['pums', 'gen', 'marginal'])

    def root_mean_squared_error(self):

Since these are public functions, you should document their return values


doppelganger/accuracy.py, line 190 at r2 (raw file):


    def root_squared_error(self, verbose=False):
        '''Similar to rmse, but taking the mean at the end so that the error of individual variables

rmse: abbrev


doppelganger/accuracy.py, line 231 at r2 (raw file):

            verbose (boolean): display per-variable error (inert for rmse b/c of the inner mean)
        Returns:
            error by puma (dataframe): marginal-pums, marginal-generated

As a user of this API, I'm not sure how to use this dataframe. What columns are important?


doppelganger/accuracy.py, line 235 at r2 (raw file):

            mean error (dataframe): marginal-pums, marginal-generated
        '''
        d_mp = OrderedDict()  # dictionary of marginal to pums differences

avoid abbrevs whenever possible--what does d_mp mean?


Comments from Reviewable

nikisix commented 6 years ago

Review status: 0 of 7 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


doppelganger/accuracy.py, line 31 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
I like the design of storing this data and having error reports as non-mutating functions on the data!

:)


doppelganger/accuracy.py, line 188 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Okay, given that it's a param, let's make it an enum or constants defined as class variables in a separate class so that it's clear to the user which possible values they can pass to the function. For instance, it looks like you already changed the possible values & the documentation is out of date. If you just point to a class that lists options that won't happen

Good call. Done.


doppelganger/accuracy.py, line 204 at r1 (raw file):

Previously, nikisix (niki six) wrote…
I definitely get where you're coming from. Can't right now b/c the "mean" is left off while building the pums and generated report dataframes. Then this mean is actually run on both axis, summarizing mean-by-variable and mean-by-puma error (Ex. 2 above). But I could make some small calc functions (pre-mean) and route both functions through them to make things a bit more cohesive.

It is now :)


doppelganger/accuracy.py, line 151 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
Avoid single-letter variable names. What is this dict supposed to hold? Name it for that

Done.


doppelganger/accuracy.py, line 178 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
Since these are public functions, you should document their return values

Good call. Done.


doppelganger/accuracy.py, line 190 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
rmse: abbrev

Spelled it out.


doppelganger/accuracy.py, line 231 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
As a user of this API, I'm not sure how to use this dataframe. What columns are important?

Added additional clarification about the columns of the returned data frames


doppelganger/accuracy.py, line 235 at r2 (raw file):

dictionary of marginal to pums differences

Renamed. Retaining d* vs df* to clarify dictionaries vs data frames


Comments from Reviewable

nikisix commented 6 years ago

Review status: 0 of 7 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


doppelganger/accuracy.py, line 17 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
We don't usually set loggers in lib code. Each script running this might want its own logging configuration. Remove this.

Done.


Comments from Reviewable

katbusch commented 6 years ago
:lgtm:

Awesome, thanks for bearing with me through those rounds. Okay to merge as is (assuming tests pass) but I do have a few more suggestions. Please take a look at them and either change them now or keep them in mind for the future.


Review status: 0 of 6 files reviewed at latest revision, 7 unresolved discussions.


doppelganger/accuracy.py, line 204 at r1 (raw file):

Previously, nikisix (niki six) wrote…
It is now :)

Yay :)


doppelganger/accuracy.py, line 151 at r2 (raw file):

Previously, nikisix (niki six) wrote…
Done.

Hmm... I still don't really know what comparison is. What are the keys & values of the map?


doppelganger/accuracy.py, line 235 at r2 (raw file):

Previously, nikisix (niki six) wrote…
> # dictionary of marginal to pums differences Renamed. Retaining d_* vs df_* to clarify dictionaries vs data frames

pums_diffs_by_marginals?


doppelganger/accuracy.py, line 165 at r3 (raw file):

                        generated_households[generated_households[variable] == bin].count()[0]
                    )
                elif variable == 'num_vehicles':

I just noticed. Isn't the code in all 3 of these if statements the same? Can it just be if variable in ('age', 'num_people', 'num_vehicles'):?


doppelganger/scripts/download_allocate_generate.py, line 215 at r3 (raw file):


        controls = Marginals.from_csv(marginal_path)
    except Exception:  # Download marginal data from the Census API

May want to catch a more specific exception here. Is this for when the file is missing? If the file is, say, badly formatted we might want to raise that issue to the user. In fact we probably want to log the error either way. Not needed in this diff


Comments from Reviewable

nikisix commented 6 years ago

My pleasure. Will do.


Review status: 0 of 6 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


doppelganger/accuracy.py, line 151 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
Hmm... I still don't really know what comparison is. What are the keys & values of the map?

This is the comparison data frame. a column for pums, doppelganger-generated synths, and a column for the marginal-controls. all of the statistics operate on this data frame.

(num_people, 1)      8584   7877      8132
(num_people, 3)      7142   6911      6121
(num_people, 2)     12218  12481     13753
(num_people, 4+)    11863  11683     10775
(num_vehicles, 1)    9310   8816      7262
(num_vehicles, 0)    3591    765       457
(num_vehicles, 2)   19130  18934     25777
(num_vehicles, 3+)  10584  10437     20119
(age, 0-17)         32863  31368     30130
(age, 18-34)        22476  18070     21490
(age, 65+)          13997  12662     12568
(age, 35-64)        45626  44232     45501

doppelganger/accuracy.py, line 235 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
`pums_diffs_by_marginals`?

i like diff_marginal_pums and diff_marginal_doppelganger if that's amenable.

Sorry about the large heading font, that was a total markdown character surprise, def not shouting or anything like that :peace_symbol:


doppelganger/accuracy.py, line 165 at r3 (raw file):

Previously, katbusch (Kat Busch) wrote…
I just noticed. Isn't the code in all 3 of these if statements the same? Can it just be `if variable in ('age', 'num_people', 'num_vehicles'):`?

It should actually be condensed down to an if and elif, but it's a little more complicated than that b/c you have know which variables correspond to households and which to individuals. Which can be done from inputs.py, but it was just as short to write this way. However, this could def be upgraded going forward, especially when we start adding more controls.


doppelganger/scripts/download_allocate_generate.py, line 215 at r3 (raw file):

Previously, katbusch (Kat Busch) wrote…
May want to catch a more specific exception here. Is this for when the file is missing? If the file is, say, badly formatted we might want to raise that issue to the user. In fact we probably want to log the error either way. Not needed in this diff

It's true. I may have changed this up a bit in some other branches.. I'll take a look though


Comments from Reviewable

katbusch commented 6 years ago

Reviewed 1 of 6 files at r1. Review status: 1 of 6 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


doppelganger/accuracy.py, line 151 at r2 (raw file):

Previously, nikisix (niki six) wrote…
This is the comparison data frame. a column for pums, doppelganger-generated synths, and a column for the marginal-controls. all of the statistics operate on this data frame. ``` pums gen marginal (num_people, 1) 8584 7877 8132 (num_people, 3) 7142 6911 6121 (num_people, 2) 12218 12481 13753 (num_people, 4+) 11863 11683 10775 (num_vehicles, 1) 9310 8816 7262 (num_vehicles, 0) 3591 765 457 (num_vehicles, 2) 19130 18934 25777 (num_vehicles, 3+) 10584 10437 20119 (age, 0-17) 32863 31368 30130 (age, 18-34) 22476 18070 21490 (age, 65+) 13997 12662 12568 (age, 35-64) 45626 44232 45501 ```

got it, I'd suggest a comment explaining that's what the variable holds


doppelganger/accuracy.py, line 165 at r3 (raw file):

Previously, nikisix (niki six) wrote…
It should actually be condensed down to an if and elif, but it's a little more complicated than that b/c you have know which variables correspond to households and which to individuals. Which can be done from inputs.py, but it was just as short to write this way. However, this could def be upgraded going forward, especially when we start adding more controls.

Since the code is all 3 is the same afaict you should be able to replace with a single if


Comments from Reviewable

nikisix commented 6 years ago

Review status: 1 of 6 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


doppelganger/accuracy.py, line 151 at r2 (raw file):

Previously, katbusch (Kat Busch) wrote…
got it, I'd suggest a comment explaining that's what the variable holds

Done.


doppelganger/accuracy.py, line 165 at r3 (raw file):

Previously, katbusch (Kat Busch) wrote…
Since the code is all 3 is the same afaict you should be able to replace with a single if

Did you notice that we're pulling from person_pums and generated_persons vs household_pums vs generated_households depending on the variable?

                    comparison[(variable, bin)].append(
                        person_pums[person_pums[variable] == bin].person_weight.sum())
                    comparison[(variable, bin)].append(
                        generated_persons[generated_persons[variable] == bin].count()[0]
                    )
                elif variable == 'num_people':
                    comparison[(variable, bin)].append(household_pums[
                        household_pums[variable] == bin].household_weight.sum())
                    comparison[(variable, bin)].append(
                        generated_households[generated_households[variable] == bin].count()[0]
                    )

Comments from Reviewable

katbusch commented 6 years ago

Review status: 1 of 6 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


doppelganger/accuracy.py, line 165 at r3 (raw file):

Previously, nikisix (niki six) wrote…
Did you notice that we're pulling from person_pums and generated_persons vs household_pums vs generated_households depending on the variable? ``` if variable == 'age': comparison[(variable, bin)].append( person_pums[person_pums[variable] == bin].person_weight.sum()) comparison[(variable, bin)].append( generated_persons[generated_persons[variable] == bin].count()[0] ) elif variable == 'num_people': comparison[(variable, bin)].append(household_pums[ household_pums[variable] == bin].household_weight.sum()) comparison[(variable, bin)].append( generated_households[generated_households[variable] == bin].count()[0] ) ```

Nope. Definitely didn't notice that.


Comments from Reviewable

nikisix commented 6 years ago

Review status: 1 of 6 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


doppelganger/accuracy.py, line 151 at r2 (raw file):

Previously, nikisix (niki six) wrote…
Done.

In the method description now.


doppelganger/accuracy.py, line 165 at r3 (raw file):

Previously, katbusch (Kat Busch) wrote…
Nope. Definitely didn't notice that.

:smile: No doubt it can be spruced up though.


Comments from Reviewable