replicahq / doppelganger

A Python package of tools to support population synthesizers
Apache License 2.0
165 stars 32 forks source link

explicit dtypes to preserve leading zeros for PUMA and state codes #61

Closed martibosch closed 6 years ago

martibosch commented 6 years ago

explicit handling of DataFrame dtypes - PUMA and state codes as strings to preserve leading zeros allow filtering by PUMA and/or state separately Solves #59, #60


This change is Reviewable

katbusch commented 6 years ago

Thank you so much for doing this! A few small comments but looking good to go very soon! I believe between this and #62 both example notebooks should work completely


Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.


doppelganger/datasource.py, line 31 at r1 (raw file):

        cleaned_data = preprocessor.process_dataframe(self.data, field_names, self.name_map)
        if state is not None or puma is not None:
            query_idx = ~cleaned_data.index.isnull()

What's the purpose of this line?


doppelganger/datasource.py, line 35 at r1 (raw file):

                query_idx = query_idx & (cleaned_data[inputs.STATE.name].astype(str) == str(state))
            if puma is not None:
                query_idx = query_idx & (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))

Nit: we don't use abbrevs. query_idx --> query_index


doppelganger/datasource.py, line 67 at r1 (raw file):


    @staticmethod
    def from_csv(infile, dtype=None):

Can you add documentation for what dtype is as you do below?


doppelganger/datasource.py, line 71 at r1 (raw file):

            from pandas.compat import text_type
            # Code fields as str to keep leading zeros
            dtype = {

Nit: should be on three lines like so

dtype =  {
    column: text_type for column in [inputs.PUMA.pums_name, inputs.STATE.pums_name]
}

doppelganger/marginals.py, line 187 at r1 (raw file):

            state (unicode): state fips code (2-digit)
            puma (unicode): puma code (5-digit)
            dtype (dict): pandas dtype dict

What types are the keys and values in the dict? See here for an example of dict typing: https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/bayesnets.py#L128


Comments from Reviewable

martibosch commented 6 years ago

Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.


doppelganger/datasource.py, line 31 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
What's the purpose of this line?

the purpose is to get an array of True values with the same lenght of cleaned_data in order to apply further filters, since at that point we do not know if onlystate, onlypumaor both are provided. I must say now that we're reviewing it, that it's simplier (and has better performance) to usenp.ones(len(cleaned_data), dtype=bool)`. Sorry about that.


doppelganger/datasource.py, line 35 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Nit: we don't use abbrevs. query_idx --> query_index

Noted.


doppelganger/datasource.py, line 67 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Can you add documentation for what dtype is as you do below?

Noted


doppelganger/datasource.py, line 71 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
Nit: should be on three lines like so ``` dtype = { column: text_type for column in [inputs.PUMA.pums_name, inputs.STATE.pums_name] } ```

Noted


doppelganger/marginals.py, line 187 at r1 (raw file):

Previously, katbusch (Kat Busch) wrote…
What types are the keys and values in the dict? See here for an example of dict typing: https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/bayesnets.py#L128

Noted


Comments from Reviewable

katbusch commented 6 years ago
:lgtm:

Great, thanks so much for the fixes @martibosch! Merging this now


Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.


doppelganger/datasource.py, line 31 at r1 (raw file):

Previously, martibosch (Martí Bosch) wrote…
the purpose is to get an array of True values with the same lenght of `cleaned_data in order to apply further filters, since at that point we do not know if only `state`, only `puma` or both are provided. I must say now that we're reviewing it, that it's simplier (and has better performance) to use `np.ones(len(cleaned_data), dtype=bool)`. Sorry about that.

Ah I see! Makes sense.


Comments from Reviewable