jordanijames / School-Segregation

MIT License
1 stars 0 forks source link

Adding ACS data #4

Open jordanijames opened 4 weeks ago

jordanijames commented 4 weeks ago

Hi @AaronGullickson, I added and cleaned the acs county data and the control variables I want. I'm not sure how to add this to the school_dissim data set.

AaronGullickson commented 3 weeks ago

You have already labeled the FIPS code as county_code so you should just be able to left join it:

school_dissim <- left_join(school_dissim, acs_county, by=c("county_code"))

However, when I do this I get a lot of missing data from the ACS which doesn't make any sense. Looking at it, you only have 840 counties in the ACS data which is way too few. I am not sure what happened there, but maybe something in how you created the extract?

AaronGullickson commented 3 weeks ago

Also, use read_csv not read.csv.

jordanijames commented 3 weeks ago

So should I even include this then if there is so much missing data? I'm not sure what I did wrong. In my analysis can I talk about future plans including control data instead?

jordanijames commented 3 weeks ago

I also could only find acs data for 2019 and not 2020, maybe that could be the issue.

AaronGullickson commented 3 weeks ago

It would not change that much.

jordanijames commented 3 weeks ago

The 840 counties are just what ACS gave me when I picked the variables I wanted. I'm also trying to create my graphs and models, but it's non-linear and I don't know how to make it linear like how we did in class. I'm also getting an error for my model. I tried just doing what we did for the non-linearity assignment.

AaronGullickson commented 3 weeks ago

Ahh, I see the problem. The 1-year ACS estimates only give you a limited set of counties, probably the most populous. You should use the 5-year estimates with the midpoint year being the one you want (so 2017-2021 gives you estimates centered on 2019). Those 5 year estimates will give you all counties (around 3200).

jordanijames commented 3 weeks ago

I added ACS 5-year estimates and I created three models, I'm not sure what I'm doing wrong with my graphs, they're non-linear and I don't know how to fix them. Can I add control variables to my graphs? I'm not sure how to do that.

AaronGullickson commented 3 weeks ago

Ok, first things first. Your code for the analysis needs to be in analysis.qmd. You should save your analytical dataset at the bottom of organize_data.qmd with:

save(school_dissim, file=here("data", "data_constructed", "analytical_data.RData"))

Note you will probably need to create an empty "data_constructed" folder inside the data folder in your project. Then, you read this dataset into analysis.qmd with:

load(here("data", "data_constructed", "analytical_data.RData"))

Its important for the analysis to be separated both logically and mechanically. You don't want to have to rerun your organize data script every time you want to test something out in your analysis.

Now beyond that, the problem you are looking at is heavy right-skewness on your private school count variable. The standard way to handle that would be to log that variable. You try logging both but I don't think that there is any need to log the dependent variable and doing so seems to lead to some problems of heteroskedasticity where you get a cone shape going from right to left.

Now the problem with logging the private school count variable is that log(0) is undefined so you lose all cases where there are no private schools in a county, which is quite a few. An alternative approach would be to do a square root transformation but the results from that are somewhat difficult to interpret directly.

However, I think the more fundamental problem is that you private school count variable is not a great variable. More heavily populated counties will have more schools in general including private schools. So, your variable is mixing together the thing you care about (private schools as an option) and overall population size and we can't tell which of these is driving the effect. I have suggested to alternate approaches a couple of times:

  1. Calculate the percentage of all schools that are private.
  2. Calculate the percentage of all students who go to private schools.

I think if you did either one of those you could separate out the effect you care about and you probably would not see the heavy skewness you are seeing now.

In terms of adding controls, you can't really add them to the figure. Just use the models.

AaronGullickson commented 3 weeks ago

I see some of your later models are holding total population as a control which is another way to handle the issue, although I think not as good as the suggestions above.

jordanijames commented 3 weeks ago

Okay, I'm trying to write up my paper and I'm talking about the issues with the data and variables there and what I can do with Christine to make it better, BUT my main.manuscript quarto document won't render it keeps failing

jordanijames commented 3 weeks ago

Actually nevermind I fixed it