kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
132 stars 31 forks source link

Census variable selection? #145

Closed ericmanning closed 8 months ago

ericmanning commented 8 months ago

Why are race totals assigned by the following variables

r_whi = "P12I_001N"
r_bla = "P12B_001N"
r_his = "P12H_001N"
r_asi = c("P12D_001N", "P12E_001N")
r_oth = c("P12C_001N", "P12F_001N", "P12G_001N")

which correspond to the following Census tables

2020 DHC table 2010 DHC table Title
P12B P12B SEX BY AGE FOR SELECTED AGE CATEGORIES (BLACK OR AFRICAN AMERICAN ALONE)
P12C P12C SEX BY AGE FOR SELECTED AGE CATEGORIES (AMERICAN INDIAN AND ALASKA NATIVE ALONE)
P12D P12D SEX BY AGE FOR SELECTED AGE CATEGORIES (ASIAN ALONE)
P12E P12E SEX BY AGE FOR SELECTED AGE CATEGORIES (NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE)
P12F P12F SEX BY AGE FOR SELECTED AGE CATEGORIES (SOME OTHER RACE ALONE)
P12G P12G SEX BY AGE FOR SELECTED AGE CATEGORIES (TWO OR MORE RACES)
P12H P12H SEX BY AGE FOR SELECTED AGE CATEGORIES (HISPANIC OR LATINO)
P12I P12I SEX BY AGE FOR SELECTED AGE CATEGORIES (WHITE ALONE, NOT HISPANIC OR LATINO)

and not the following tables' variables instead?

2020 DHC table 2010 DHC table Title
P12H P12H SEX BY AGE FOR SELECTED AGE CATEGORIES (HISPANIC OR LATINO)
P12I P12I SEX BY AGE FOR SELECTED AGE CATEGORIES (WHITE ALONE, NOT HISPANIC OR LATINO)
P12J N/A SEX BY AGE FOR SELECTED AGE CATEGORIES (BLACK OR AFRICAN AMERICAN ALONE, NOT HISPANIC OR LATINO)
P12K N/A SEX BY AGE FOR SELECTED AGE CATEGORIES (AMERICAN INDIAN AND ALASKA NATIVE ALONE, NOT HISPANIC OR LATINO)
P12L N/A SEX BY AGE FOR SELECTED AGE CATEGORIES (ASIAN ALONE, NOT HISPANIC OR LATINO)
P12M N/A SEX BY AGE FOR SELECTED AGE CATEGORIES (NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE, NOT HISPANIC OR LATINO)
P12N N/A SEX BY AGE FOR SELECTED AGE CATEGORIES (SOME OTHER RACE ALONE, NOT HISPANIC OR LATINO)
P12O N/A SEX BY AGE FOR SELECTED AGE CATEGORIES (TWO OR MORE RACES, NOT HISPANIC OR LATINO)

Using the former yields aggregate population counts that exceed the population total for each geography because it ought to double-count non-white Hispanic or Latino individuals. The latter yields matching counts.

ericmanning commented 8 months ago

Might be related to #138

1beb commented 8 months ago

Thank you, getting these tables right is a challenge sometimes. I'm checking in with the team on this one.

ericmanning commented 8 months ago

The Census Bureau did not publish the P12J through P12O information in any summary file for the 2010 census. So if my suggestion is correct, then you can't actually tabulate age and sex by block and H/L and race for 2010.

1beb commented 8 months ago

Correct, we need to setup warnings so that people use an older version of the package as it's not backwards compatible with pre-2020.

ericmanning commented 8 months ago

FWIW, the package has always used the current set of variables for sex and age, which are incorrect -- so (correct me if I'm wrong) any version will produce inaccurate estimates for 2010 if age OR sex is TRUE

From wru-0.1-12/R/census_geo_api.R,

if (age == F & sex == F) {
    num <- ifelse(3:10 != 10, paste("0", 3:10, sep = ""), "10")
    vars <- paste("P0050", num, sep = "")
  }

  if (age == F & sex == T) {
    eth.let <- c("I", "B", "H", "D", "E", "F", "C")
    num <- as.character(c("01", "02", "26"))
    vars <- NULL
    for (e in 1:length(eth.let)) {
      vars <- c(vars, paste("P012", eth.let[e], "0", num, sep = ""))
    }
  }

  if (age == T & sex == F) {
    eth.let <- c("I", "B", "H", "D", "E", "F", "C")
    num <- as.character(c(c("01", "03", "04", "05", "06", "07", "08", "09"), seq(10, 25), seq(27, 49)))
    vars <- NULL
    for (e in 1:length(eth.let)) {
      vars <- c(vars, paste("P012", eth.let[e], "0", num, sep = ""))
    }
  }

  if (age == T & sex == T) {
    eth.let <- c("I", "B", "H", "D", "E", "F", "C")
    num <- as.character(c(c("01", "03", "04", "05", "06", "07", "08", "09"), seq(10, 25), seq(27, 49)))
    vars <- NULL
    for (e in 1:length(eth.let)) {
      vars <- c(vars, paste("P012", eth.let[e], "0", num, sep = ""))
    }
  }