replicahq / doppelganger

A Python package of tools to support population synthesizers
Apache License 2.0
165 stars 32 forks source link

keep leading zeros in code columns - dtypes #60

Closed martibosch closed 6 years ago

martibosch commented 6 years ago

In dataframes, columns corresponding to state and puma codes should preserve the leading zeros e.g. 00106 becomes 106 when pandas automatically infers numeric types As encountered with doppelganger_example_simple.ipynb, this can lead to wrong filter results in dataframes e.g. in lines 31-34 of doppelganger/datasource.py

cleaned_data = cleaned_data[
    (cleaned_data[inputs.STATE.name].astype(str) == str(state)) &
    (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))
]

the left-hand-side can become 106 (due to pandas auto dtype inference) whereas the right-hand-side (passed by the user) is '00106'. As remarked in #59, there should be a general strategy to control the columns dtypes.

martibosch commented 6 years ago

another option is to change the comparison logic of the filter so that e.g. a string with '00106' can also be compared with a string '106' or an integer 106 and viceversa

katbusch commented 6 years ago

Thanks for the reports @martibosch

We welcome contributions!

katbusch commented 6 years ago

Fixed by #61