psrc / urbansim2

3 stars 0 forks source link

households_transition step results in records in persons table that do not have related records in households #123

Closed stefancoe closed 6 years ago

stefancoe commented 6 years ago

Edited for clarity

After this step, the Persons table has 6033 records with invalid household_id (no corresponding records in the households table).

stefancoe commented 6 years ago

I think I see why this is happening.

The function/method 'transition' on the TabularGrowthRateTransition class in transition.py returns updated, added_indexes, copied_indexes, & removed_indexes. Updated is the modified households table that represents existing households & additional households due to growth for a given year, as specified by a control totals file. This file also provides distribution parameters.

I think the problem starts on lines 305 at the start of the loop that goes through the control totals and makes adjustments to the households: Line 305 creates the variable 'subset'- a subset of 'data', which is the households table:

subset = util.filter_table(data, row, ignore={self._config_column})

Line 319 returns an updated (households) table along with objects for rows that are copied, added & deleted:

updated, added, copied, removed = \
                add_or_remove_rows(subset, nrows, starting_index, self.accounting_column)

Later on, the persons table ('table' in code below) is updated on line 519:

for table_name, (table, col) in linked_tables.items():
            logger.debug('updating linked table {}'.format(table_name))
            updated_links[table_name] = \
                _update_linked_table(table, col, added, copied, removed)

The problem, I think, is that it is possible that an existing household, say household_id 878, never gets picked as part of the variable 'subset'. Since 'added', 'copied' and 'removed' are all based on the variable subset, household 878 is not dealt with (removed from persons) because it was never picked on line 305. This household is no longer in the households table because the households table is set to 'updated', which was created from subset on line 319.

The actual updating occurs in utils

On line 580:

model = transition.TransitionModel(tran)
    new, added_hh_idx, new_linked = model.transition(hh, year, linked_tables=linked_tables)
    new.loc[added_hh_idx, location_fname] = -1
    print "Total agents after transition: {}".format(len(new))
    orca.add_table(agents.name, new)
    for table_name, table in new_linked.iteritems():
        print "Total %s after transition: %s" % (table_name, len(table))
        orca.add_table(table_name, table)

where agents.name is 'households' and new is 'updated'. And orca.add_table(table_name, table) is adding the modified persons table, which still has household_id 878, which is no longer in households.

stefancoe commented 6 years ago

It turns out the missing households have an income of 49,999 or 74,999 so this is a problem with our control totals.

msimonson-psrc commented 6 years ago

We have our control totals as 0 to 49,999 ... 50,000 to 74,999 .... 75,000 to 99,999 ... and 100,000 up to Infinity. Assuming all income values are integers, that would mean that the max ends of the ranges are being interpreted exclusively? IE we should list them as 0 to 50,000 ... 50,000 to 75,000 ... etc.? And probably a fluke that no HHs with income of $99,999 also in the missing?

stefancoe commented 6 years ago

Thanks Mark. And no fluke, just me not looking far enough down the table. Households with an income of 99,999 are in there as well.

stefancoe commented 6 years ago

@msimonson-psrc- Here is an example of one of the queries from the code. So it looks like >= is used for the min and < is used for the max. 'income < 99999.0 and income >= 75000.0 and persons < 1.5 and persons >= 1.0 and workers < 0.5 and workers >= 0.0'

msimonson-psrc commented 6 years ago

What's best fix, code change or input change? Feels more intuitive to me to leave the ranges in input table inclusive, and make it "income <= 99999.0 and income >= 75000.0 and ... " But might just be me, I'm open. Hana, this in our current/opus version of code too?

hanase commented 6 years ago

In Opus, I think it's done the way you say, but that has the disadvantage that you could not use continuous values (it presumes integers). I think in order to be able to use the urbansim2 code without modifications, we need to make changes in the control totals table. We could change the Opus code to work the same way, so that we don't need two versions of the table.

hanase commented 6 years ago

I realized I already made changes to the control totals used in urbansim2, namely for workers and persons, as it would not work when both min and max are the same number: https://github.com/psrc/urbansim2/issues/26 So for example for households with 1 persons, we need to have min 1 and max e.g. 1.5. I did not do this change for income. Stefan, I'm wondering if we could have a little script that would do these changes on the fly during the conversion into the h5 file, so that we don't need to do data and code changes in Opus.

stefancoe commented 6 years ago

Hana- that sounds good to me.

stefancoe commented 6 years ago

Fixed with this commit: https://github.com/psrc/urbansim2/commit/3246899996c0777d02e3c6bc60969840c08712bd

hanase commented 6 years ago

@stefancoe - this change has been made into cache_to_hdf5.py, but not into estimation_cache_to_hdf5.py, which is going to be the main conversion script, for both, simulation and estimation, right?

It also needs the changes described in #26 for persons_max and workers_max. Previously I made those changes in mysql, but it's easier to put it into the conversion script. So the script needs these three lines in the HH control totals section:

df.income_max = np.where(df.income_max<>-1, df.income_max + 1, -1)
df.workers_max = np.where(df.workers_max<>-1, df.workers_max + 0.5, -1)
df.persons_max = np.where(df.persons_max<>-1, df.persons_max + 0.5, -1)
stefancoe commented 6 years ago

This is all covered by one conversion script