Closed misea closed 3 years ago
For comparison, in Diamondhead MS (93% White, 2.9% Black), someone with the name SMITH is has a predicted probability of being White of .95 .
The assumption of the method is that conditional on race, surname and geolocation are independent of one another. Please see https://imai.fas.harvard.edu/research/files/race.pdf for details. Also note that the method is trying to get things correct on average, not for each specific name.
Yes, thank you - that's the paper I was looking at. I was wondering whether, given the very different underlying race distributions by state, the concrete implementation of the algorithm (as opposed to the description in the paper) may be introducing avoidable bias. If the set G included all census divisions, not just the ones in a particular state, wouldn't results likely be more comparable across states?
I didn't see anything in the paper that indicated that G should be per-state, though I realize that with certain assumptions the results would match.
You are correct in that the bias is introduced to the extent the assumption is violated. In our experience, a finer census area can give you a better prediction. Unfortunately, the statistics on names and race are only available at the national level although you could collect additional data from southern states, for example, where voter files ask people about their race. You can incorporate these and other additional pieces of information to improve the accuracy of the predictions. In the paper, for example, we use partisanship to improve the prediction.
After looking at the code for a bit, I got somewhat confused about predictions for results across states.
If I estimate race across states with very different overall racial makeup, I get results I was not expecting. For example, if I compare Auburn, ME (~95% White, 2.5% Black) with Batesville, MS (52% White, 46% Black). I find someone named SMITH is more likely to be White in MS.
yields
What I don't get: The values for r_whi etc, are calculated on a per-state basis ignoring national demographics, but the census surname distributions are national. Are these reconciled somewhere or is this just based on an assumption that name distributions are independent of states (something to do with note 4 in your paper)? I admit to being a Bayesian newbie, so I'm likely just missing something or my expectations may be wrong.