Comments on the 1033 data / analysis

evanmiller commented 3 years ago

Not sure if this is the right place, but I wanted to offer a few brief comments on the 1033 data and analysis in this repo:

I see that population=0 in many of the data rows. This looks like non-normal measurement error that may wreck the regression estimates.
Population would be a good candidate for an exposure variable (i.e. estimate killings per population). This can be included as such using + offset(log(population)) in the glm specification.
Addressing 1 and 2 (i.e. dropping zero population observations and using an offset variable), I get an estimated effect about 1/3 the size of that reported. I also see that the med_inc turns negative (and becomes statistically insignificant), which resolves the unintuitive result that was reported.

Model:

glm(n_kill ~ year + officers_total + v_crime + ln(1+value_v) + ln(1+value_nv) + drug_use + med_inc + black_perc + offset(log(population)), family=poisson)

Estimates:

outcome variable	explanatory variable	coefficient	standard error	z-score	p-value	95% confidence interval (lower)	95% confidence interval (upper)
n_kill	year	0.052272811187214	0.023946393761619	2.182909531496802	0.029042472979556	0.005338741854826	0.099206880519602
n_kill	officers_total	-0.000270208193352	0.000359283618236	-0.752074905830492	0.45200601386805	-0.00097439114533	0.000433974758627
n_kill	v_crime	0.000097563462557	0.000044771666347	2.179134048800151	0.02932171151276	0.00000981260899	0.000185314316124
n_kill	ln(1+value_v)	0.039398557099263	0.016692925981437	2.360194800065343	0.018265340920982	0.006681023379054	0.072116090819473
n_kill	ln(1+value_nv)	0.003347464878661	0.023585716666399	0.141927630438708	0.887137169300111	-0.042879690337047	0.049574620094368
n_kill	drug_use	-6.728686262466446	10.338661851849096	-0.650827578934986	0.515157795183758	-26.992091140428855	13.534718615495962
n_kill	med_inc	-0.000004027405887	0.000005268084493	-0.764491513378137	0.444574392165331	-0.000014352661761	0.000006297849987
n_kill	black_perc	-0.279186965700285	0.517662232830434	-0.539322647073881	0.58966424480124	-1.293786298204523	0.735412366803953
n_kill	constant	-117.42035536270915	48.133921516644904	-2.439451257303121	0.014709586925584	-211.76110797001073	-23.079602755407564

Happy to discuss in more detail (or not!).

nthieme commented 3 years ago

Thanks for commenting, I appreciate you spending some time with the data.

1) is something I actually went back and forth about. I'm aware of the measurement error, but, in general, I'd rather not throw away observations on the basis of a variable that isn't directly of interest if I don't have to. I'd looked at the model both ways, including v excluding the population = 0 rows, and it doesn't make a big difference in the inference, so I kept it.

2) I'd also considered 2), but I don't think it makes sense here for a couple reasons. More than anything, the research papers I replicated mention that population affects the number of killings in other ways than just being an "opportunity," which turned me off of the idea. Lawson, for example, lists a couple ways that population acts on the psychology of officers. It just doesn't seem like we can assume a fixed coefficient of 1.

In general, when I'm replicating a study like this, I'd like to stay as close to the work as possible, unless they're making a clear error, which doesn't seem to be the case here.

evanmiller commented 3 years ago

Thanks for the fast reply. I think econometrically you'd want to set unknown populations to the mean population (rather than 0) so that the measurement error would be centered around zero. Regarding (2) you could include population both as an offset and as a separate coefficient to account for the officer psychology. But likely some of the other variables would then need to be re-specified to work as per-capita figures.

A couple more points, which you can take or leave:

I would think "year" should be included as a fixed effect rather than a linear regressor - but when I tried it I failed to get convergence.
Since military equipment are durable goods, it might make sense to use cumulative spending (possibly depreciated) as the regressor rather than annual spending.

In any event it is clear that you have put a ton of work into this project and I'm very glad you've made the sources and methods publicly available! I will take a look at your references to get a better sense of the previous work in this area.

nthieme / AJC_work

Comments on the 1033 data / analysis #1