kuriwaki / ccesMRPprep

Functions to Clean and Prepare CCES data for MRP
https://www.shirokuriwaki.com/ccesMRPprep/
Other
8 stars 1 forks source link

missing education codes #12

Open ylelkes opened 1 year ago

ylelkes commented 1 year ago

Hi there, Any idea why acscodes_age_sex_educ only returns rows for "HS or Less" and "4-Year"?

acs_tab <- get_acs_cces( varlist = acscodes_sex_educ_race, varlab_df = acscodes_df, year = 2021,dataset = "acs5" )

Screenshot 2023-05-08 at 3 02 13 PM

thanks!

kuriwaki commented 1 year ago

The recodings of the education to factors were incorrect. It should have been that for this partition,

table of codes ending in: associated education level

Below is a demo without the get_acs_cces wrapper.

I will try to fix it soon in dev, but I might need to do be creative since the way education is grouped here is different from the other partition acscodes_age_sex_educ that we were using in the paper. Apologies.

Variables

library(ccesMRPprep)
library(tidycensus)
library(dplyr)

acs5_vars <- load_variables("acs5", year = 2021)

acs5_vars |> 
  filter(name %in% acscodes_sex_educ_race)
#> # A tibble: 48 × 4
#>    name        label                                           concept geography
#>    <chr>       <chr>                                           <chr>   <chr>    
#>  1 C15002B_004 Estimate!!Total:!!Male:!!High school graduate … SEX BY… tract    
#>  2 C15002B_005 Estimate!!Total:!!Male:!!Some college or assoc… SEX BY… tract    
#>  3 C15002B_006 Estimate!!Total:!!Male:!!Bachelor's degree or … SEX BY… tract    
#>  4 C15002B_009 Estimate!!Total:!!Female:!!High school graduat… SEX BY… tract    
#>  5 C15002B_010 Estimate!!Total:!!Female:!!Some college or ass… SEX BY… tract    
#>  6 C15002B_011 Estimate!!Total:!!Female:!!Bachelor's degree o… SEX BY… tract    
#>  7 C15002C_004 Estimate!!Total:!!Male:!!High school graduate … SEX BY… tract    
#>  8 C15002C_005 Estimate!!Total:!!Male:!!Some college or assoc… SEX BY… tract    
#>  9 C15002C_006 Estimate!!Total:!!Male:!!Bachelor's degree or … SEX BY… tract    
#> 10 C15002C_009 Estimate!!Total:!!Female:!!High school graduat… SEX BY… tract    
#> # ℹ 38 more rows

Created on 2023-05-08 with reprex v2.0.2

ylelkes commented 1 year ago

Thank you!

From: Shiro Kuriwaki @.> Date: Monday, May 8, 2023 at 10:33 PM To: kuriwaki/ccesMRPprep @.> Cc: ylelkes @.>, Author @.> Subject: Re: [kuriwaki/ccesMRPprep] missing education codes (Issue #12)

The recodings of the education to factors were incorrect. It should have been that for this partition,

table of codes ending in: associated education level

Below is a demo without the get_acs_cces wrapper.

I will try to fix it soon in dev, but I might need to do be creative since the way education is grouped here is different from the other partition acscodes_age_sex_educ that we were using in the paper. Apologies.

Variables

library(ccesMRPprep)

library(tidycensus)

library(dplyr)

acs5_vars <- load_variables("acs5", year = 2021)

acs5_vars |>

filter(name %in% acscodes_sex_educ_race)

> # A tibble: 48 × 4

> name label concept geography

>

> 1 C15002B_004 Estimate!!Total:!!Male:!!High school graduate … SEX BY… tract

> 2 C15002B_005 Estimate!!Total:!!Male:!!Some college or assoc… SEX BY… tract

> 3 C15002B_006 Estimate!!Total:!!Male:!!Bachelor's degree or … SEX BY… tract

> 4 C15002B_009 Estimate!!Total:!!Female:!!High school graduat… SEX BY… tract

> 5 C15002B_010 Estimate!!Total:!!Female:!!Some college or ass… SEX BY… tract

> 6 C15002B_011 Estimate!!Total:!!Female:!!Bachelor's degree o… SEX BY… tract

> 7 C15002C_004 Estimate!!Total:!!Male:!!High school graduate … SEX BY… tract

> 8 C15002C_005 Estimate!!Total:!!Male:!!Some college or assoc… SEX BY… tract

> 9 C15002C_006 Estimate!!Total:!!Male:!!Bachelor's degree or … SEX BY… tract

> 10 C15002C_009 Estimate!!Total:!!Female:!!High school graduat… SEX BY… tract

> # ℹ 38 more rows

Created on 2023-05-08 with reprex v2.0.2https://reprex.tidyverse.org

— Reply to this email directly, view it on GitHubhttps://github.com/kuriwaki/ccesMRPprep/issues/12#issuecomment-1539079337, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAM5U6WDG56IY2P6AQ5UAKDXFGUGNANCNFSM6AAAAAAX2KKZB4. You are receiving this because you authored the thread.Message ID: @.***>

ylelkes commented 1 year ago

in case it's helpful, here is what i did, combining your code with some of my own:

library(ccesMRPprep)
library(tidycensus)
library(tidyverse)
library(glue)
# categories in ACS
ages  <- c("18 to 24 years",
           "25 to 34 years",
           "35 to 44 years",
           "45 to 64 years",
           "65 years and over",
           "18 and 19 years",
           "20 to 24 years",
           "25 to 29 years",
           "30 to 34 years",
           "35 to 44 years",
           "45 to 54 years",
           "55 to 64 years",
           "65 to 74 years",
           "75 to 84 years",
           "85 years and over")
education <- c("Less than high school diploma",
               "High school graduate \\(includes equivalency\\)",
               "Some college or associate's degree",
               "Bachelor's degree or higher")
races <- c("White alone, not Hispanic or Latino",
           "Hispanic or Latino",
           "Black or African American alone",
           "American Indian and Alaska Native alone",
           "Asian alone",
           "Native Hawaiian and Other Pacific Islander alone",
           "Some other race alone",
           "Two or more races" #,
           # "Two or more races!!Two races including Some other race",
           # "Two or more races!!Two races excluding Some other race, and three or more races"
)

ages_regex  <- as.character(glue("({str_c(ages, collapse = '|')})"))
edu_regex   <- as.character(glue("({str_c(education, collapse = '|')})"))
races_regex <- as.character(glue("({str_c(races, collapse = '|')})"))

# get vars ----
acs5_vars <- load_variables("acs5", year = 2021)

vars <- acs5_vars |> 
  filter(name %in% c(acscodes_sex_educ_race,paste0("C15002",LETTERS[2:9],"_003")))

# format these and recode
# to strings ----

vars <- vars %>%
  mutate(variable = name) %>%
  separate(name, sep = "_", into = c("table", "num")) %>%
  select(variable, table, concept, num, label, everything()) %>%
  filter(str_detect(label, "Total")) %>%
  mutate(label = str_remove(label, "Estimate!!Total")) %>%
  mutate(gender = str_extract(label, "(Male|Female)"),
         age = str_extract(label, ages_regex),
         educ = str_extract(label, edu_regex),
         race = coalesce(str_extract(label, regex(races_regex, ignore_case = TRUE)),
                         str_extract(concept, regex(races_regex, ignore_case = TRUE))))

acs_tab <- tidycensus::get_acs(geography = "congressional district",variables = vars$variable,year = 2021)
out <- acs_tab %>% left_join(vars,by = "variable")

out <- out %>% mutate(race_new = 
                 case_when(race=="WHITE ALONE, NOT HISPANIC OR LATINO"~"White",
                           race=="BLACK OR AFRICAN AMERICAN ALONE"~"Black",
                           race=="ASIAN ALONE"~"Asian",
                           race=="AMERICAN INDIAN AND ALASKA NATIVE ALONE"~"Native American",
                           race=="HISPANIC OR LATINO"~"Hispanic",
                           TRUE ~ "All Other"),
                 educ_new = case_when(educ=="Less than high school diploma"~'HS or Less',
                                      educ=="High school graduate (includes equivalency)"~'HS or Less',
                                      TRUE~educ))
out$educ_race <- interaction(out$race_new,out$educ_new)

post_strat <- out %>% group_by(GEOID,gender,educ=educ_new,educ_race,race=race_new) %>% summarise(n=sum(estimate))%>% group_by(GEOID) %>% mutate(cd_total=sum(n),cd_per=n/cd_total) %>% drop_na()

# create a vector of state abbreviations and FIPS codes
state_info <- data.frame(
  state_abbrev = c(
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
    'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
    'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
    'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
    'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC'
  ),
  state_fips = c(
    '01', '02', '04', '05', '06', '08', '09', '10', '12', '13',
    '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
    '35', '36', '37', '38', '39', '40', '41', '42', '44', '45',
    '46', '47', '48', '49', '50', '51', '53', '54', '55', '56', '11'
  )
)
# create a function to convert a vector of geoids to state abbreviations and district numbers
convert_geoids <- function(geoids) {
  # create an empty vector to store the converted geocodes
  converted <- vector(mode = "character", length = length(geoids))

  for (i in seq_along(geoids)) {
    # extract the state FIPS code and congressional district number from the geoid
    state_fips <- substr(geoids[i], 1, 2)
    district <- substr(geoids[i], 3, 4)

    # convert the state FIPS code to a state abbreviation
    if (state_fips == "11") {
      # Washington DC
      state_abbrev <- "DC"
    } else {
      # find the state abbreviation corresponding to the state FIPS code
      state_index <- which(state_info$state_fips == state_fips)
      state_abbrev <- state_info$state_abbrev[state_index]
    }

    # convert district "00" to "01"
    if (district == "00") {
      district <- "01"
    }
    # convert district "98" to "01"
    if (district == "98") {
      district <- "01"
    }

    # combine the state abbreviation and district number into a single string
    converted[i] <- paste(state_abbrev, district, sep = "-")
  }

  return(converted)
}

# create a function to extract state FIPS codes from converted geocodes
extract_state_fips <- function(converted_geocodes) {
  # create an empty vector to store the state FIPS codes
  state_fips <- vector(mode = "character", length = length(converted_geocodes))

  for (i in seq_along(converted_geocodes)) {
    # split the converted geocode into state abbreviation and district number
    parts <- strsplit(converted_geocodes[i], "-")[[1]]
    state_abbrev <- parts[1]

    if (state_abbrev == "DC") {
      # Washington DC
      state_fips[i] <- "11"
    } else {
      # find the state FIPS code corresponding to the state abbreviation
      state_index <- which(state_info$state_abbrev == state_abbrev)

      if (length(state_index) == 0) {
        # no matching state abbreviation found
        state_fips[i] <- NA
      } else {
        state_fips[i] <- state_info$state_fips[state_index]
      }
    }
  }

  return(state_fips)
}

post_strat$cd <- convert_geoids(post_strat$GEOID)
post_strat$STATEFIP <- extract_state_fips(post_strat$cd)
post_strat$GEOID[is.na(post_strat$STATEFIP)]

post_strat$pct_trump <- cd_info_2020$pct_trump[match(post_strat$cd,cd_info_2020$cd)]
post_strat$trumpvote5050 <- abs(post_strat$pct_trump-.50)
save(post_strat,file = "poststrat_cd.RData")
kuriwaki commented 1 year ago

I started some edits in https://github.com/kuriwaki/ccesMRPprep/tree/fix_iss12 which I will update and merge.

@ylelkes thanks for the suggested code. It seems like in this case, education is the three way coding where BA and post-grads is one level. Which means that the CES survey data will need to be recoded to fit with that coding. This goes to what I am contemplating: whether to try and distinguish between 4-way education and educ_3 which is this three-way coding in the output of get_acs_cces, OR whether to try and coalesce to one (so the user will need to know how to recode education in the CES side).

The edits in the branch currently does the latter

kuriwaki commented 1 year ago

I changed plans and will do the former -- formally distinguish between educ (4-way) and educ_3 (3-way). This way there is no ambiguity between the two types when obtaining data. I added educ_3 as a standard demographic in ccc_std_demographics().

library(ccesMRPprep)
library(dplyr)
packageVersion("ccesMRPprep")
#> [1] '0.1.11.9999'

acs_tab <- get_acs_cces(
  varlab_df = acscodes_df,
  varlist = acscodes_sex_educ_race,
  year = 2021,
  dataset = "acs5"
)
#> Getting data from the 2017-2021 5-year ACS
# no educ, only educ_3
acs_tab
#> # A tibble: 28,096 × 9
#>    acscode      year cd    gender female educ_3            race  count count_moe
#>    <chr>       <dbl> <chr> <fct>   <int> <fct>             <fct> <dbl>     <dbl>
#>  1 C15002B_003  2021 AL-01 Male        0 HS or Less        Black  9804       713
#>  2 C15002B_004  2021 AL-01 Male        0 HS or Less        Black 25034      1256
#>  3 C15002B_005  2021 AL-01 Male        0 Some College      Black 14180       995
#>  4 C15002B_006  2021 AL-01 Male        0 4-Year or Post-G… Black  6767       804
#>  5 C15002B_008  2021 AL-01 Female      1 HS or Less        Black  8304       734
#>  6 C15002B_009  2021 AL-01 Female      1 HS or Less        Black 23756      1237
#>  7 C15002B_010  2021 AL-01 Female      1 Some College      Black 23514       945
#>  8 C15002B_011  2021 AL-01 Female      1 4-Year or Post-G… Black 13255       840
#>  9 C15002C_003  2021 AL-01 Male        0 HS or Less        Nati…   520       200
#> 10 C15002C_004  2021 AL-01 Male        0 HS or Less        Nati…   624       168
#> # ℹ 28,086 more rows
count(acs_tab, educ_3)
#> # A tibble: 3 × 2
#>   educ_3                  n
#>   <fct>               <int>
#> 1 HS or Less          14048
#> 2 Some College         7024
#> 3 4-Year or Post-Grad  7024

# This one gets you four-way education
get_acs_cces(
  varlab_df = acscodes_df,
  varlist = acscodes_age_sex_educ,
  year = 2021,
  dataset = "acs5"
)
#> Getting data from the 2017-2021 5-year ACS
#> # A tibble: 30,730 × 9
#>    acscode     year cd    gender female educ         age         count count_moe
#>    <chr>      <dbl> <chr> <fct>   <int> <fct>        <fct>       <dbl>     <dbl>
#>  1 B15001_004  2021 AL-01 Male        0 HS or Less   18 to 24 y…   329       146
#>  2 B15001_005  2021 AL-01 Male        0 HS or Less   18 to 24 y…  4670       591
#>  3 B15001_006  2021 AL-01 Male        0 HS or Less   18 to 24 y… 12135       801
#>  4 B15001_007  2021 AL-01 Male        0 Some College 18 to 24 y…  9791       841
#>  5 B15001_008  2021 AL-01 Male        0 Some College 18 to 24 y…  1507       361
#>  6 B15001_009  2021 AL-01 Male        0 4-Year       18 to 24 y…  1379       324
#>  7 B15001_010  2021 AL-01 Male        0 Post-Grad    18 to 24 y…    65        44
#>  8 B15001_012  2021 AL-01 Male        0 HS or Less   25 to 34 y…  1009       281
#>  9 B15001_013  2021 AL-01 Male        0 HS or Less   25 to 34 y…  4920       742
#> 10 B15001_014  2021 AL-01 Male        0 HS or Less   25 to 34 y… 15597      1116
#> # ℹ 30,720 more rows

# ccc_std_demographics provides BOTH educ and educ_3
ccc_std_demographics(ccc_samp) |> 
  count(educ, educ_3)
#> age variable modified to bins. Original age variable is now in age_orig.
#> # A tibble: 4 × 3
#>   educ             educ_3                      n
#>   <dbl+lbl>        <dbl+lbl>               <int>
#> 1 1 [HS or Less]   1 [HS or Less]            293
#> 2 2 [Some College] 2 [Some College]          355
#> 3 3 [4-Year]       3 [4-Year or Post-Grad]   244
#> 4 4 [Post-Grad]    3 [4-Year or Post-Grad]   108

Created on 2023-06-19 with reprex v2.0.2