Closed martibosch closed 6 years ago
Thank you so much for doing this! A few small comments but looking good to go very soon! I believe between this and #62 both example notebooks should work completely
Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.
doppelganger/datasource.py, line 31 at r1 (raw file):
cleaned_data = preprocessor.process_dataframe(self.data, field_names, self.name_map) if state is not None or puma is not None: query_idx = ~cleaned_data.index.isnull()
What's the purpose of this line?
doppelganger/datasource.py, line 35 at r1 (raw file):
query_idx = query_idx & (cleaned_data[inputs.STATE.name].astype(str) == str(state)) if puma is not None: query_idx = query_idx & (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))
Nit: we don't use abbrevs. query_idx --> query_index
doppelganger/datasource.py, line 67 at r1 (raw file):
@staticmethod def from_csv(infile, dtype=None):
Can you add documentation for what dtype is as you do below?
doppelganger/datasource.py, line 71 at r1 (raw file):
from pandas.compat import text_type # Code fields as str to keep leading zeros dtype = {
Nit: should be on three lines like so
dtype = {
column: text_type for column in [inputs.PUMA.pums_name, inputs.STATE.pums_name]
}
doppelganger/marginals.py, line 187 at r1 (raw file):
state (unicode): state fips code (2-digit) puma (unicode): puma code (5-digit) dtype (dict): pandas dtype dict
What types are the keys and values in the dict? See here for an example of dict typing: https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/bayesnets.py#L128
Comments from Reviewable
Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.
doppelganger/datasource.py, line 31 at r1 (raw file):
What's the purpose of this line?
the purpose is to get an array of True values with the same lenght of cleaned_data in order to apply further filters, since at that point we do not know if only
state, only
pumaor both are provided. I must say now that we're reviewing it, that it's simplier (and has better performance) to use
np.ones(len(cleaned_data), dtype=bool)`. Sorry about that.
doppelganger/datasource.py, line 35 at r1 (raw file):
Nit: we don't use abbrevs. query_idx --> query_index
Noted.
doppelganger/datasource.py, line 67 at r1 (raw file):
Can you add documentation for what dtype is as you do below?
Noted
doppelganger/datasource.py, line 71 at r1 (raw file):
Nit: should be on three lines like so ``` dtype = { column: text_type for column in [inputs.PUMA.pums_name, inputs.STATE.pums_name] } ```
Noted
doppelganger/marginals.py, line 187 at r1 (raw file):
What types are the keys and values in the dict? See here for an example of dict typing: https://github.com/sidewalklabs/doppelganger/blob/master/doppelganger/bayesnets.py#L128
Noted
Comments from Reviewable
Great, thanks so much for the fixes @martibosch! Merging this now
Review status: 0 of 6 files reviewed at latest revision, 5 unresolved discussions.
doppelganger/datasource.py, line 31 at r1 (raw file):
the purpose is to get an array of True values with the same lenght of `cleaned_data in order to apply further filters, since at that point we do not know if only `state`, only `puma` or both are provided. I must say now that we're reviewing it, that it's simplier (and has better performance) to use `np.ones(len(cleaned_data), dtype=bool)`. Sorry about that.
Ah I see! Makes sense.
Comments from Reviewable
explicit handling of DataFrame dtypes - PUMA and state codes as strings to preserve leading zeros allow filtering by PUMA and/or state separately Solves #59, #60
This change is![Reviewable](https://reviewable.io/review_button.svg)