replicahq / doppelganger

A Python package of tools to support population synthesizers
Apache License 2.0
165 stars 32 forks source link

marginals dtypes #59

Closed martibosch closed 6 years ago

martibosch commented 6 years ago

I came across this issue in doppelganger_example_full.ipynb when creating the marginals from the census data

new_marginal_filename = os.path.join(output_dir, 'new_marginals.csv')

with open('sample_data/2010_puma_tract_mapping.txt') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    marginals = Marginals.from_census_data(
        csv_reader, CENSUS_KEY, state=STATE, pumas=PUMA
    )
    marginals.write(new_marginal_filename)

and passing marginals directly to the HouseholdAllocator as in:

allocator = HouseholdAllocator.from_cleaned_data(marginals, households_data, persons_data)

yields:

Truncated Traceback (Use C-c C-x to view full TB):
/home/martibosch/activitysim/src/doppelganger/doppelganger/allocation.pyc in _allocate_households(households, persons, tract_controls)
    163         w_extend = np.tile(w, (n_tracts, 1))
    164         mu_extend = np.mat(np.tile(mu, (n_tracts, 1)))
--> 165         B = np.mat(np.dot(np.ones((1, n_tracts)), A)[0])
    166 
    167         # Our trade-off coefficient gamma

TypeError: can't multiply sequence by non-int of type 'float'

(this does not happen hen reading marginals from a csv i.e. marginals = Marginals.from_csv(new_marginal_filename) since the types are correctly inferred.

So I guess this could be fixed by explicitly controlling the dtypes as in:

modified   doppelganger/marginals.py
@@ -165,8 +165,12 @@ class Marginals(object):
                     output.append(str(controls_dict[control_name]))
                 data.append(output)

-        columns = ['STATEFP', 'COUNTYFP', 'PUMA5CE', 'TRACTCE'] + list(CONTROL_NAMES)
-        return Marginals(pandas.DataFrame(data, columns=columns))
+        code_columns = ['STATEFP', 'COUNTYFP', 'PUMA5CE', 'TRACTCE']
+        control_columns = list(CONTROL_NAMES)
+        marginals_df = pandas.DataFrame(data, columns=code_columns + control_columns)
+        marginals_df[code_columns] = marginals_df[code_columns].astype(str)
+        marginals_df[control_columns] = marginals_df[control_columns].astype(int)

This conflicts with the test MarginalsTest.test_fetch_marginals, but the test could be easily fixed since it is only a matter of python strings:

_________________________________________________________ MarginalsTest.test_fetch_marginals _________________________________________________________

self = <test_marginals.MarginalsTest testMethod=test_fetch_marginals>

    def test_fetch_marginals(self):
        state = self._mock_marginals_file()[0]['STATEFP']
        puma = self._mock_marginals_file()[0]['PUMA5CE']
        with patch('doppelganger.marginals.Marginals._fetch_from_census',
                   return_value=self._mock_response()):
            marg = Marginals.from_census_data(
                    puma_tract_mappings=self._mock_marginals_file(), census_key=None,
                    state=state, pumas=set([puma])
                )
        expected = {
            'STATEFP': '06',
            'COUNTYFP': '075',
            'PUMA5CE': '07507',
            'TRACTCE': '023001',
            'age_0-17': '909',
            'age_18-34': '1124',
            'age_65+': '713',
            'age_35-64': '2334',
            'num_people_count': '1335',
            'num_people_1': '168',
            'num_people_3': '304',
            'num_people_2': '341',
            'num_people_4+': '522',
            'num_vehicles_0': '0',
            'num_vehicles_1': '1',
            'num_vehicles_2': '2',
            'num_vehicles_3+': '3'
        }
        result = marg.data.loc[0].to_dict()
>       self.assertDictEqual(result, expected)
E       AssertionError: {u'num_people_4+': 522, u'num_people_3': 304, u'num_people_2': 341, u'num_people [truncated]... != {u'num_people_4+': u'522', u'age_18-34': u'1124', u'num_people_1': u'168', u'age [truncated]...
E       Diff is 1298 characters long. Set self.maxDiff to None to see it.

test/test_marginals.py:86: AssertionError

I guess more issues of this types could be encountered, so perhaps there should be an overall strategy to deal with the columns dtypes. If you agree, I could spend some time on it and submit a PR :)

katbusch commented 6 years ago

Fixed by #61