explicit dtypes to preserve leading zeros for PUMA and state codes

martibosch commented 6 years ago

explicit handling of DataFrame dtypes - PUMA and state codes as strings to preserve leading zeros allow filtering by PUMA and/or state separately Solves #59, #60

This change is

katbusch commented 6 years ago

Thank you so much for doing this! A few small comments but looking good to go very soon! I believe between this and #62 both example notebooks should work completely

Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.

doppelganger/datasource.py, line 31 at r1 (raw file):

        cleaned_data = preprocessor.process_dataframe(self.data, field_names, self.name_map)
        if state is not None or puma is not None:
            query_idx = ~cleaned_data.index.isnull()

What's the purpose of this line?

doppelganger/datasource.py, line 35 at r1 (raw file):

                query_idx = query_idx & (cleaned_data[inputs.STATE.name].astype(str) == str(state))
            if puma is not None:
                query_idx = query_idx & (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))

Nit: we don't use abbrevs. query_idx --> query_index

doppelganger/datasource.py, line 67 at r1 (raw file):


    @staticmethod
    def from_csv(infile, dtype=None):

Can you add documentation for what dtype is as you do below?

doppelganger/datasource.py, line 71 at r1 (raw file):

            from pandas.compat import text_type
            # Code fields as str to keep leading zeros
            dtype = {

Nit: should be on three lines like so

dtype =  {
    column: text_type for column in [inputs.PUMA.pums_name, inputs.STATE.pums_name]
}

doppelganger/marginals.py, line 187 at r1 (raw file):

            state (unicode): state fips code (2-digit)
            puma (unicode): puma code (5-digit)
            dtype (dict): pandas dtype dict

What types are the keys and values in the dict? See here for an example of dict typing: https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/bayesnets.py#L128

Comments from Reviewable

martibosch commented 6 years ago

Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.

doppelganger/datasource.py, line 31 at r1 (raw file):