Open ozanj opened 5 years ago
@cyouh95
This is wonderful! thank you so much Crystal!!!!
I pulled the changes to my rclass github repository.
I could open the .Rdata you created. but I could not run your save_ccd_pss_data.R script.
Here is error I got when I tried to run the first substantive line of code [line 7]
characteristics <- read_sas('data/ccd/2015-16/ccd_1516_characteristics.sas7bdat') Error: 'data/ccd/2015-16/ccd_1516_characteristics.sas7bdat' does not exist in current working directory ('C:/Users/ozanj/Documents/rclass').
when I went to that directory I only saw zip files for the data, but not the unzipped files. is this the problem? I want patricia to be able to run this script without errors. seems like best approach would be to leave data as is and insert an unzip function in the script? what do you think?
@cyouh95
second question. I was looking at the data frames ccd and pss.
most of the categorical variables in pss have the "labelled" class. by contrast, most of the categorical variables in the data frame ccd have character type/class. what is the reason for this?
would like the two data sources to be broadly consistent on this front. what is your recommendation for how/what to change?
@ozanj Ah yup, the data files were originally unzipped, but when I tried to push it to Github, I got an error bc some of the files were too large, so I had to zip them up. But if it's possible to have the script unzip the file (or maybe somehow read in the zipped file?), I think that would be a good idea! I can try to look into this. (in the meantime, hopefully the R script works if you unzip the files locally!)
The only variables converted to labelled
class were the ones in the SAS files (13 vars for ccd; 194 vars for pss) - would there be any other vars that need to be labelled besides those? (ie. what would the attached label for the other categorical vars be?)
Thank you Crystal!
would it be possible to modify the script to unzip the files as part of the script.
second, in either the haven labelled package, I believe there is a function to convert factor class variables to labelled class variables. function called to_labelled() https://cran.r-project.org/web/packages/labelled/labelled.pdf
would you be willing to convert all factor variables to labelled for both pss and ccd
third, can you convert ccd variable names to lowercase prior to saving to .Rdata? I can do this easily on my own, but don't want students to have to do this within the problem set.
fourth, is it possible to create a version of the zip-code level ACS data with characteristics on zip code that has variable labels and value labels? I believe this is the file: C:\Users\ozanj\Dropbox\recruiting-m1\analysis\data\zip_to_state
I realize that our "production database" doesn't have variable labels and value labels, so let me know how much work this would take before you undertake the work. I know you have a lot on your plate already.
sorry to dump so much on you Crystal.
Best wishes. ozan.
@ozanj No problem! Just added items 1 & 3 in commit above.
Not quite sure about 2... I don't think any of the variables are read in as factors (either character or numeric), so would we need to manually pick out which categorical variables we want to turn into factor and label those?
Data currently looks something like this:
stabr is already labelled
bc it was in the SAS format file (so AR
is labelled w/ Arkansas
, CA
is labelled w/ California
, etc.) Do we then also want to make statename (which is currently character
) into factor
, then convert to labelled
too?
If I tried doing that using to_labelled()
function here, the data now looks like:
where the underlying var is now 5
, 7
, etc, but attached labels would say ARKANSAS
, CALIFORNIA
. Would this be what we want?
For item 4, just took a look at the zip_to_states.csv
- for value labels, there doesn't seem to be many categorical variables to label I think. Maybe we could do state_code
, state_fips
, etc - but the other variables can't really be labelled, could they? (ie. total pop, median income, etc)
Sorry for all the questions!
Thank you Crystal!
your questions are good ones. and after looking through the data more deeply I see that my advice was not good!
I think the ccd data are weird. for example, why this variable is a string variable rather than numeric
str(ccd$titlei_status) 'labelled' chr [1:102401] "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" "-9" ...
attributes(ccd$titlei_status) $
label
[1] "TITLE I status (code)"
$format.sas [1] "$CHAR"
$labels Title I targeted assistance eligible school-No program Title I targeted assistance school "1" "2" Title I schoolwide eligible-Title I targeted assistance program Title I schoolwide eligible school-No program "3" "4" Title I schoolwide school Not a Title I school "5" "6" Suppressed Missing "-9" "M"
with that said, I see that one of the values is "M" so that is why it is a character variable. I agree with your decision that it is not a good idea to use the to_labelled() function to change vars to labelled class. so if you have done that, then please undo it. sorry and thanks!
As for item #4 (ACS), I agree with you that only a couple of variables deserve to be class == labelled or class == factor. I don't think worthwhile to do so for state_code, and state_fips. I just looked through zip_to_states.csv and looks like no variables with making one of these two classes. that said, is there meta-data for variable labels that it would be worth adding to this dataset? if it would take you more than 15 minutes to do this, then don't do it.
final thing, on this folkder C:\Users\ozanj\Dropbox\recruiting-m1\analysis\data, I just noted this file hs_data_forRclass.csv. did you create this file?
sorry for all the questions! thank you! ozan.
@cyouh95
oh, I forgot to @ you in the last post. please see post above!
@ozanj I didn't change the original .RData
yet (for item 3) - so won't be a problem to just remove that part from the R script!
For 4, I think adding variable labels shouldn't take too long. I could probably just copy the official names from here?
And nope, I didn't add that file in Dropbox... perhaps Karina did?
@cyouh95
Thanks Crystal! just save it to the recruiting-m1 dropbox folder. maybe change the name of the one currently there to add a "_" at the end.
appreciate all your help!
@ozanj No problem! To double check, this would need to be saved as .RData
(not flat CSV file) - because of the labels?
@cyouh95 on second thought, if you haven't done anything to ACS yet, then don't bother. I'll just add some variable labels within the pipe that reads the data in the problem set script.
thanks for all your help w/ this!
@cyouh95 sorry that you took the time to do this. but having the labels is great. I just ended up putting all that code in my problem set. thank you Crystal!
@cyouh95
Hi Crystal,
I have a data request for you for my R class. sorry to add to your already very full plate!
I'm beginning to create a problem set for my R class about merging data sources.
The problem set will use events_data.csv from C:\Users\ozanj\Dropbox\recruiting-m1\analysis\data but will only keep variables about the events.
Then I want students to merge in public school data from CCD and private school data from PSS. But I want them to merge in R-versions of the datasets found on the NCES website:
for public schools: https://nces.ed.gov/ccd/pubschuniv.asp
for private schools: https://nces.ed.gov/surveys/pss/pssdata.asp
The data can be downloaded in csv format, SAS datasets, or SPSS datasets. I used the haven package to read in the SAS version of one of the 2015-16 CCD datasets. that worked fine, but value labels were not retained so students would have no idea what the different values on variables mean. There is a separate SAS file (and SPSS file) that creates value labels. I'm not sure how to run the SAS/SPSS code from R, but I think it is possible. basically, I want variables with value labels to have class=labelled and to have both variable labels and value labels.
Can you create .Rdata files that meet these specifications?
I want this for all 2015-16 school-level CCD files and for 2015-16 PSS file.
I may ask you to do this for some small snippet of ACS data too. students really don't like working with variables that don't have variable labels and most of our data isn't labelled. but don't do anything with ACS data yet.
sorry for the extra request, but need help w/ this problem set so that I can get to writing the recruiting paper.
best wishes. ozan.