ozanj / rclass

4 stars 3 forks source link

read in ccd and pss data with variable labels and value labels #13

Open ozanj opened 5 years ago

ozanj commented 5 years ago

@cyouh95

Hi Crystal,

I have a data request for you for my R class. sorry to add to your already very full plate!

I'm beginning to create a problem set for my R class about merging data sources.

The problem set will use events_data.csv from C:\Users\ozanj\Dropbox\recruiting-m1\analysis\data but will only keep variables about the events.

Then I want students to merge in public school data from CCD and private school data from PSS. But I want them to merge in R-versions of the datasets found on the NCES website:

for public schools: https://nces.ed.gov/ccd/pubschuniv.asp

for private schools: https://nces.ed.gov/surveys/pss/pssdata.asp

The data can be downloaded in csv format, SAS datasets, or SPSS datasets. I used the haven package to read in the SAS version of one of the 2015-16 CCD datasets. that worked fine, but value labels were not retained so students would have no idea what the different values on variables mean. There is a separate SAS file (and SPSS file) that creates value labels. I'm not sure how to run the SAS/SPSS code from R, but I think it is possible. basically, I want variables with value labels to have class=labelled and to have both variable labels and value labels.

Can you create .Rdata files that meet these specifications?

I want this for all 2015-16 school-level CCD files and for 2015-16 PSS file.

I may ask you to do this for some small snippet of ACS data too. students really don't like working with variables that don't have variable labels and most of our data isn't labelled. but don't do anything with ACS data yet.

sorry for the extra request, but need help w/ this problem set so that I can get to writing the recruiting paper.

best wishes. ozan.

cyouh95 commented 5 years ago

@ozanj Here is the data. I wasn't too sure if it's possible to run the SAS code from R, so I ended up parsing the variable labels file (as suggested here). Hope this looks okay!

ozanj commented 5 years ago

@cyouh95

This is wonderful! thank you so much Crystal!!!!

I pulled the changes to my rclass github repository.

I could open the .Rdata you created. but I could not run your save_ccd_pss_data.R script.

Here is error I got when I tried to run the first substantive line of code [line 7]

characteristics <- read_sas('data/ccd/2015-16/ccd_1516_characteristics.sas7bdat') Error: 'data/ccd/2015-16/ccd_1516_characteristics.sas7bdat' does not exist in current working directory ('C:/Users/ozanj/Documents/rclass').

when I went to that directory I only saw zip files for the data, but not the unzipped files. is this the problem? I want patricia to be able to run this script without errors. seems like best approach would be to leave data as is and insert an unzip function in the script? what do you think?

ozanj commented 5 years ago

@cyouh95

second question. I was looking at the data frames ccd and pss.

most of the categorical variables in pss have the "labelled" class. by contrast, most of the categorical variables in the data frame ccd have character type/class. what is the reason for this?

would like the two data sources to be broadly consistent on this front. what is your recommendation for how/what to change?

cyouh95 commented 5 years ago

@ozanj Ah yup, the data files were originally unzipped, but when I tried to push it to Github, I got an error bc some of the files were too large, so I had to zip them up. But if it's possible to have the script unzip the file (or maybe somehow read in the zipped file?), I think that would be a good idea! I can try to look into this. (in the meantime, hopefully the R script works if you unzip the files locally!)

The only variables converted to labelled class were the ones in the SAS files (13 vars for ccd; 194 vars for pss) - would there be any other vars that need to be labelled besides those? (ie. what would the attached label for the other categorical vars be?)

ozanj commented 5 years ago

Thank you Crystal!

would it be possible to modify the script to unzip the files as part of the script.

second, in either the haven labelled package, I believe there is a function to convert factor class variables to labelled class variables. function called to_labelled() https://cran.r-project.org/web/packages/labelled/labelled.pdf

would you be willing to convert all factor variables to labelled for both pss and ccd

third, can you convert ccd variable names to lowercase prior to saving to .Rdata? I can do this easily on my own, but don't want students to have to do this within the problem set.

fourth, is it possible to create a version of the zip-code level ACS data with characteristics on zip code that has variable labels and value labels? I believe this is the file: C:\Users\ozanj\Dropbox\recruiting-m1\analysis\data\zip_to_state

I realize that our "production database" doesn't have variable labels and value labels, so let me know how much work this would take before you undertake the work. I know you have a lot on your plate already.

sorry to dump so much on you Crystal.

Best wishes. ozan.

cyouh95 commented 5 years ago

@ozanj No problem! Just added items 1 & 3 in commit above.

Not quite sure about 2... I don't think any of the variables are read in as factors (either character or numeric), so would we need to manually pick out which categorical variables we want to turn into factor and label those?

Data currently looks something like this: 1

stabr is already labelled bc it was in the SAS format file (so AR is labelled w/ Arkansas, CA is labelled w/ California, etc.) Do we then also want to make statename (which is currently character) into factor, then convert to labelled too?

If I tried doing that using to_labelled() function here, the data now looks like: 2

where the underlying var is now 5, 7, etc, but attached labels would say ARKANSAS, CALIFORNIA. Would this be what we want?

For item 4, just took a look at the zip_to_states.csv - for value labels, there doesn't seem to be many categorical variables to label I think. Maybe we could do state_code, state_fips, etc - but the other variables can't really be labelled, could they? (ie. total pop, median income, etc)

Sorry for all the questions!

ozanj commented 5 years ago

Thank you Crystal!

your questions are good ones. and after looking through the data more deeply I see that my advice was not good!

I think the ccd data are weird. for example, why this variable is a string variable rather than numeric

str(ccd$titlei_status) 'labelled' chr [1:102401] "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" ...

$format.sas [1] "$CHAR"

$labels Title I targeted assistance eligible school-No program Title I targeted assistance school "1" "2" Title I schoolwide eligible-Title I targeted assistance program Title I schoolwide eligible school-No program "3" "4" Title I schoolwide school Not a Title I school "5" "6" Suppressed Missing "-9" "M"

with that said, I see that one of the values is "M" so that is why it is a character variable. I agree with your decision that it is not a good idea to use the to_labelled() function to change vars to labelled class. so if you have done that, then please undo it. sorry and thanks!

As for item #4 (ACS), I agree with you that only a couple of variables deserve to be class == labelled or class == factor. I don't think worthwhile to do so for state_code, and state_fips. I just looked through zip_to_states.csv and looks like no variables with making one of these two classes. that said, is there meta-data for variable labels that it would be worth adding to this dataset? if it would take you more than 15 minutes to do this, then don't do it.

final thing, on this folkder C:\Users\ozanj\Dropbox\recruiting-m1\analysis\data, I just noted this file hs_data_forRclass.csv. did you create this file?

sorry for all the questions! thank you! ozan.

ozanj commented 5 years ago

@cyouh95

oh, I forgot to @ you in the last post. please see post above!

cyouh95 commented 5 years ago

@ozanj I didn't change the original .RData yet (for item 3) - so won't be a problem to just remove that part from the R script!

For 4, I think adding variable labels shouldn't take too long. I could probably just copy the official names from here?

And nope, I didn't add that file in Dropbox... perhaps Karina did?

cyouh95 commented 5 years ago

@ozanj Here are the variable labels for zip_to_states.csv data! [x] Haven't saved it anywhere yet, not sure if there's anything else to modify/add first?

ozanj commented 5 years ago

@cyouh95

Thanks Crystal! just save it to the recruiting-m1 dropbox folder. maybe change the name of the one currently there to add a "_" at the end.

appreciate all your help!

cyouh95 commented 5 years ago

@ozanj No problem! To double check, this would need to be saved as .RData (not flat CSV file) - because of the labels?

ozanj commented 5 years ago

@cyouh95 on second thought, if you haven't done anything to ACS yet, then don't bother. I'll just add some variable labels within the pipe that reads the data in the problem set script.

thanks for all your help w/ this!

cyouh95 commented 5 years ago

@ozanj Np! If it's useful, labelled variable names data is also here.

ozanj commented 5 years ago

@cyouh95 sorry that you took the time to do this. but having the labels is great. I just ended up putting all that code in my problem set. thank you Crystal!