Organizing Data - Githubissues

jordanijames commented 1 month ago

@AaronGullickson Hello! I have organized the data, I selected and renamed the variables for both data sets. For the new subsets I made "public_2019" and "private_2019", there is a placeholder in the table that isn't actually data, and I don't know how to get rid of it. So if you organize the subsets by county_name, the 1st column will be empty and I want to get rid of that column.

AaronGullickson commented 1 month ago

Good work!

It looks like we are still getting one summary line at the bottom of the file:

tail(private_schools)
# A tibble: 6 × 19
  `Private School Name`              State Name [Private …¹ American Indian/Alas…² Asian or Asian/Pacif…³
  <chr>                              <chr>                                   <dbl>                  <dbl>
1 "ZION TEMPLE CHRISTIAN ACADEMY"    "OHIO"                                      0                      0
2 "ZION'S HILL BAPTIST SCHOOL"       "INDIANA"                                   0                      0
3 "ZION-ST JOHN LUTHERAN SCHOOL"     "IOWA"                                      0                      3
4 "ZUNI CHRISTIAN MISSION SCHOOL"    "NEW MEXICO"                               93                      0
5 "ZVI DOV ROTH ACADEMY OF YESHIVA … "NEW YORK"                                  0                      0
6 "Data Source: U.S. Department of … ""                                         NA                     NA

Reducing n_max by 1 should fix that problem.

I would not group_by county name because county names are not unique across states. Thats what the fips number is for. So, use county_code in the group_by command.

Also, the county_code variable has leading zeroes which is causing it to be treated as a character variable. You can fix that easily by doing:

public_2019 <- public_2019 |>
    mutate(county_code = as.numeric(county_code))

private_2019 <- private_2019 |>
    mutate(county_code = as.numeric(county_code))

You could also fix this when you read it in by specifying col_types but given the ugly variable names it would be a pain.

Lastly, the number_of_private_schools is a bit verbose. We want to strike a balance between variable names that have meaning and ones that are so long that they make our code look terrible. I think something like n_private whould be sufficient.

jordanijames commented 1 month ago

I made all the changes! I pushed what I have so far. Now I think I need to do the dissimilarity index part, but I'm kind of confused about how and when I should merge the public_2019 and private_county subsets. Also, how do I combine all the non-white race variables? Would I just make a new variable to add to the table? Do I make a public_county subset and group that a certain way? I know how to calculate the dissimilarity index I'm just not sure how to get there.

AaronGullickson commented 1 month ago

You can create a new nonwhite variable by just adding up the other ones. This code will trim down the dataset and help you see how this will all work (replace temp with something better):

temp <- public_2019 |>
  mutate(NonWhite = AIAN+Asian+Hispanic+Black+Hawaiian+Multiracial) |>
  select(county_code, White, NonWhite) |>
  drop_na() |>
  arrange(county_code)

That is all you really need to calculate your dissimilarity measures. Its very similar to what we did last term for tracts, but now instead of tracts you have schools.

jordanijames commented 1 month ago

Thank you so much! Another question! when I arrange(county_code) in the table I get the same county code over and over in the columns (county_code 1001, 1001, 1001, 1001) is that supposed to happen? Because in the private_county one when I group by county_code it doesn't do that.

AaronGullickson commented 1 month ago

You haven't grouped yet. Each observation is a school and there are many schools per county so you see it many times. The same was true of the private school data before you grouped it. When you calculate the segregation index you will group the public school data as well.

jordanijames commented 1 month ago

Good morning Aaron, I'm sorry but I don't know what I'm doing wrong or missing. I'm using the group_by function and it still doesn't group the county_code variable. I tried doing what I did for private_county, but it's not working. I don't think I need the summarize command. I did change them to numeric values instead of characters, so I don't know what I'm doing wrong.

AaronGullickson commented 1 month ago

I am not seeing what code you are referring to. I see the creation of the private_county and public_county objects. That code works. The public_county object is not a county-level dataset though, its a school-level dataset.

AaronGullickson commented 1 month ago

Regarding this code:

calc_dissimilarity <- function(public_county) {
  a <- public_county$White/sum(public_county$White)
  b <- public_county$NonWhite/sum(public_county$NonWhite)
  return(50 * sum(abs(a-b)))
}

it will work, but I think calling the argument public_county is confusing as you called your object that name and its not the same thing. You want to just put in all of the schools for a given county, so you might want to call this county or something.

Your commented code below will work if you change tracts to public_county:

public_county |>
  filter(county_code == "1001") |>
  calc_dissimilarity()

jordanijames / School-Segregation

Organizing Data #3