Describe the bug
I'm running through the codes in the "SCAG-DT" folder using CA census based statistical areas to define "city", so the only comparable cities in my dataset to UDP outputs are Ventura and Imperial. After getting through 4_typology.py, my dataset's Ventura tracts matched well with the UDP Ventura typology file (only off by 1 census tract), but Imperial had 11/27 tracts not matching up, mostly the displacement and gentrification categories. In my dataset, there are 31 tracts in Imperial county but only 27 in UDP imperial_typology_output.csv and scag.csv. I suspect it's because the median calculation was off due to having a different total n, creating discrepancies when creating categorical variables, which accumulate and result in different typology categories.
To Reproduce
I think the discrepancy starts around line 384 in 2_data_curation.py due to median calculation, for example rm_hinc_18 = np.nanmedian(census['hinc_18'])
At this point there should still be 31 tracts in Imperial county (based on codes from beginning of 2_data_curation.py to line 384). if I filter out tracts not in the Imperial_database_2018.csv, then the medians match.
For example, median(mydataset_31tracts["hinc_18"])=41767, median(Imperial_database_2018.csv["hinc_18"])=43651, and median(mydataset_27tracts["hinc_18"])=43651.
I think this is also happening when working with pums and zillow data to create categorical variables.
The four missing tracts are: 6025010102 6025010900 6025012302 6025940000. These tracts are in Imperialcensus_summ_2018.csv as the input at the beginning of 2_data_curation.py. Do you know why these tracts are not included? Where in the codes should I be excluding the tracts, and based on what criteria? Thanks!
Describe the bug I'm running through the codes in the "SCAG-DT" folder using CA census based statistical areas to define "city", so the only comparable cities in my dataset to UDP outputs are Ventura and Imperial. After getting through 4_typology.py, my dataset's Ventura tracts matched well with the UDP Ventura typology file (only off by 1 census tract), but Imperial had 11/27 tracts not matching up, mostly the displacement and gentrification categories. In my dataset, there are 31 tracts in Imperial county but only 27 in UDP imperial_typology_output.csv and scag.csv. I suspect it's because the median calculation was off due to having a different total n, creating discrepancies when creating categorical variables, which accumulate and result in different typology categories.
To Reproduce I think the discrepancy starts around line 384 in 2_data_curation.py due to median calculation, for example rm_hinc_18 = np.nanmedian(census['hinc_18']) At this point there should still be 31 tracts in Imperial county (based on codes from beginning of 2_data_curation.py to line 384). if I filter out tracts not in the Imperial_database_2018.csv, then the medians match. For example, median(mydataset_31tracts["hinc_18"])=41767, median(Imperial_database_2018.csv["hinc_18"])=43651, and median(mydataset_27tracts["hinc_18"])=43651.
I think this is also happening when working with pums and zillow data to create categorical variables.
The four missing tracts are: 6025010102 6025010900 6025012302 6025940000. These tracts are in Imperialcensus_summ_2018.csv as the input at the beginning of 2_data_curation.py. Do you know why these tracts are not included? Where in the codes should I be excluding the tracts, and based on what criteria? Thanks!