jswsean / ppol564_final_project_group4

1 stars 0 forks source link

Granularity of by-race, by-day data #1

Open jswsean opened 1 year ago

jswsean commented 1 year ago

For the regression discontinuity method, we were planning on using days since State Attorney Kim Foxx's entry as our running variable. So, we centered the sentencing date variables around Foxx's entry into office and first tried to look at the granularity of the data.

sentencing_byday = sentencing_analysis.groupby(['sa_timedelta_days', 'is_black'])['is_incarcerated'].agg([('n', 'size')]).reset_index()
sentencing_byday

and our data looks like this:

image

We wanted to look at the distribution of n around the bandwidth of Foxx's entry. This is the distribution for the range of -90 and 90 days since her entry:

sentencing_byday[(sentencing_byday.sa_timedelta_days >= -90) & 
                 (sentencing_byday.sa_timedelta_days <= 90)].n.describe()

===================
count    258.000000
mean      33.759690
std       18.116474
min        1.000000
25%       22.000000
50%       30.000000
75%       47.000000
max       88.000000  

Question: Is this by-race, by-day distribution too sparse for an RD approach? Should we use weeks or months instead of days?

rebeccajohnson88 commented 1 year ago

thanks for sharing and sorry for delayed reply!

how many total observations are there in the 90 day bandwidth? i think less important than the N observations / day is the total sample size within the bandwidth

for the bandwidth selection, i think current literature favors automated bandwidth selection --- this approach is popular --- https://github.com/rdpackages/rdrobust

so you may want to check for the suggested bandwidth w/ that software then see the N defendants in that window