Add estimation datasets into the input file

hanase commented 6 years ago

Include persons_for_estimation, households_for_estimation and jobs_for_estimation into the conversion script, including setting the right unique identifier in the function convert_dirs and re-create the input data file using the latest cache.

Peter reminded me that there are some differences even in the full datasets for the case when we estimate. For example, persons_for_estimation are connected to jobs that were non-existent in our jobs table, so we had to add them and now have a new jobs table used only for estimation. Thus, we now have two base year databases, one for estimation and one for simulation. Please talk to Peter if you need to find out what and where they are.

stefancoe commented 6 years ago

For the current Urbansim implementation, the data is organized into 3 years- 2000, 2009 & 2014. From what I understand, it is important to maintain this structure since some of the data we need to for estimation live in either the 2000 or 2009 folder. For example, the full parcel table lives in the 2000 folder while the parcel folder in 2014 contains the predicated and residual land value columns, which comes from the REPM. For the script, I am thinking that we should re-implement this structure in h5, either by producing 3 H5 files or 3 groups within one file.

hanase commented 6 years ago

If we are going to organize the data by year, you might have a look at this indicator data file which extracts urbansim datasets from an output file that is structured by years. But before we do that, we need to figure out how to use lag variables in urbansim2 (in both, estimation and simulation). Here is Eddie's take on this.

stefancoe commented 6 years ago

Thanks Hana! Bear with my while I talk myself through this and propose a possible solution, The HLCM is implemented using the lcm functions in utils.py. The parameters for the lcm functions include, among others, a choosers and buildings data frame. For HLCM, the choosers data frame is the household table. To incorporate the lag variables, we need to know the previous household location, which could be on the household table- what about adding a field to the households table called previous_building_id? For estimation data, this would be populated with building ids for in-migrant households prior location. For simulation, this field would be re-populated after each simulation year by setting to the current building_id.

Because we need attributes of households' previous building to generate the lag variables, a lot depends on what happens to the the building table during simulation. Can individual building records be modified (e.g. building type)? And what happens when a building is destroyed? Does it's record remain in the table? I think what I have described will work if building records are immutable and never deleted. However, If that is not the case, we could make a copy of the buildings table after each simulation year, like Eddie describes, and include this as part of the join_tbls (additional dfs keyed on building_id) parameter in the lcm functions (estimate & simulate). Here is the description of the join_tbls function:

join_tbls : list of strings A list of land use dataframes to give neighborhood info around the buildings - will be joined to the buildings using existing broadcasts

What do you think?

hanase commented 6 years ago

Your thinking on this is spot on, Stefan. Regarding your questions:

The original attributes of the buildings dataset (e.g. building_type) are not going to change, but of course building variables (e.g. number of HHs or jobs) will change.
Building records are deleted when a building is destroyed. If we would want to keep them, I think that would complicate things in all kinds of other places.
Doing it in a similar way Eddie suggests may lead to a more generic solution, i.e. something that would work for any dataset and any attribute (just like it works in Opus).
Not sure if passing it via join_tbls would solve the issue (nice idea though), as the merging is done via an inner join, so destroyed buildings would not go through (if I understand it correctly).
I wonder if there would be a way to implement it within a variable (in the HLCM example it would be a household variable), so that a variable code would be responsible for getting the corresponding dataset from previous year (e.g. buildings) and extract from there what ever it needs.

stefancoe commented 6 years ago

Yeah, it makes sense that the join_tbls parameter would not work. I like your idea of implementing them as variables. I saw how you implemented the interaction variables and that seems pretty straightforward. I'll give it a try. Thanks!

stefancoe commented 6 years ago

Hana- I have been working off a branch called stefan_dev and have created a new conversion scripts to create the input file from the estimation data. I think this script can replace the cache_to_h5 script as it uses an is_estimation flag to differentiate between the two, but i have not fully tested it so leaving both for now.

I have also added new orca tables for all the estimation data. I thought about just bringing each one as their simulation analog, but decided to create new tables (e.g. households_for_estimation_data instead of renaming to households) because (as you note) there are differences between these tables and it's possible both the estimation data and synthetic population data are needed during estimation.

I also added a way to generate previous household location lag variables by using the existing orca table called buildings_lag1 and an existing field on the households_for_estimation data called previous_building_id. For simulation, i think the previous_building_id would be set to the current building_id and the buildings_lag_1 would be populated with the current buildings table after each simulation year. Since buildings can be deleted, we need to represent lag buildings as a separate table, but I think we can get away with a field on the households table (previous_building_id) because households never get deleted. If this does not work we can just use the existing orca table called households_lag1. Scroll to def prev_residence_is_mf for an example of a lag variable.

I am using the hlcm model to test all of this, but have hit a few bugs. I am going to work on those today and will post as issues if i cannot figure them out.

Thanks!

hanase commented 6 years ago

Thanks so much Stefan - I'll have a look! Just a note, every simulation year households do get deleted and others added. It simulates moving in and out of the area.

stefancoe commented 6 years ago

Oh ok- I thought that might be the case, but Peter thought otherwise. It should be easy enough to implement it as a separate orca table.

stefancoe commented 6 years ago

Although I think it still works because the deleted households (obviously) dont get simulated so we dont need to know anything about them.

hanase commented 6 years ago

That's right. And new households should get NA or a negative number.

psrc / urbansim2

Add estimation datasets into the input file #102