urban-displacement / displacement-typologies

The Urban Displacement Project's Displacement Typology Map code
GNU General Public License v3.0
39 stars 36 forks source link

Trouble replicating Imperial county typologies #115

Open xing-gao-phd opened 2 years ago

xing-gao-phd commented 2 years ago

Describe the bug I'm running through the codes in the "SCAG-DT" folder using CA census based statistical areas to define "city", so the only comparable cities in my dataset to UDP outputs are Ventura and Imperial. After getting through 4_typology.py, my dataset's Ventura tracts matched well with the UDP Ventura typology file (only off by 1 census tract), but Imperial had 11/27 tracts not matching up, mostly the displacement and gentrification categories. In my dataset, there are 31 tracts in Imperial county but only 27 in UDP imperial_typology_output.csv and scag.csv. I suspect it's because the median calculation was off due to having a different total n, creating discrepancies when creating categorical variables, which accumulate and result in different typology categories.

To Reproduce I think the discrepancy starts around line 384 in 2_data_curation.py due to median calculation, for example rm_hinc_18 = np.nanmedian(census['hinc_18']) At this point there should still be 31 tracts in Imperial county (based on codes from beginning of 2_data_curation.py to line 384). if I filter out tracts not in the Imperial_database_2018.csv, then the medians match. For example, median(mydataset_31tracts["hinc_18"])=41767, median(Imperial_database_2018.csv["hinc_18"])=43651, and median(mydataset_27tracts["hinc_18"])=43651.

I think this is also happening when working with pums and zillow data to create categorical variables.

The four missing tracts are: 6025010102 6025010900 6025012302 6025940000. These tracts are in Imperialcensus_summ_2018.csv as the input at the beginning of 2_data_curation.py. Do you know why these tracts are not included? Where in the codes should I be excluding the tracts, and based on what criteria? Thanks!