Open tamscenterdata opened 5 years ago
Thanks for sharing Melinda. Cleaning-up text is always a fun adventure. :smile_cat:
When column names get real scary I sometimes choose to set them manually. My favorite function for that is rename()
. It works a lot like mutate, you put the new column name on the left and the original column name that you're replacing on the right. Like this: rename(new_name = old_name)
library(readxl) # package to read an Excel file
library(dplyr)
# Load the data into a dataframe called air_data
air_data <- read_excel("La_jolla_pm25_wind_data.xls")
# Check the column names
names(air_data)
[1] "Date"
[2] "Time"
[3] "PM2.5 Conc (ug/m3)"
[4] "Wind Speed (mph)"
[5] "Wind Direction (Degrees)"
rename()
# Set column names with rename() from dplyr
air_data <- rename(air_data, pm25_ugm3 = "PM2.5 Conc (ug/m3)",
wind_spd_mph = "Wind Speed (mph)",
wind_dir_deg = "Wind Direction (Degrees)")
# Check new column names
names(air_data)
[1] "Date"
[2] "Time"
[3] "pm25_ugm3"
[4] "wind_spd_mph"
[5] "wind_dir_deg"
this script fixes all the problems with headers in the attached file and writes a new csv file to the working directory with the new headers. There are a ton of explanatory comments, because i needed to remember what each line does, and thought it might be helpful to you as well. next, cleaning the data! send comments!
la_jolla_pm25_wind_data.xlsx
copy and use any of the following to clean messy headers (names)
depending on which ones your data needs cleaned up
if you edit some rows out, just don't end on a pipe: %>%
remember that when installing a package it is in quotes
but to load into library of this project use no quotes
run each command at a time and record in the console below, by
entering cntrol-enter anywhere in each line, or group of selected lines
if you see warnings in red in the console that is often totally ok
when you see a right-pointing blue caret in the console, then
R is done and ready for more
1st load the base r library
install.packages("readr") # if you have already installed R will restart
which is fine
library(readr) # this is all you need if you already
did installation to your computer
figure out where the working directory is
getwd()
the working directory comes back as my desktop, so i want to change that
i copied the directory path from internet explorer
but it comes back w backslashes that i need to change
i used the Edit menu above to Replace and Find with
just this text selected to become
C:/Users/melinda/Dropbox (NAU)/MelindasFoldersDoNotTouch/MelindasFolders/R training with Helen/WindRose")
now i can set the working directory by
setwd("C:/Users/melinda/Dropbox (NAU)/MelindasFoldersDoNotTouch/MelindasFolders/R training with Helen/WindRose")
and i go up a few lines and run get() to verify it is in the folder i am working from
next load the packages we need to read an excel file and change headers
library(readxl)
load the data into a dataframe called air_data
i get a red error msg but then the blue caret so we can ignore it
air_data <- read_excel("La_jolla_pm25_wind_data.xls")
check the names as below
names(air_data)
i see a lot to fix in my headers (names), as follows
old_names <- names(air_data) # make object of the old names old_names # R automatically appended a _1 to the
repeated Date name (column header)
this new object shows up in the environment panel in upper right
note that it added backslashes \ to before the quotes
to signify that the dbl quote is part of the string and not an R operator
next we need to load dplyr to use a pipe, which is like a function, and
can be thought of as: (x %>% f) means the same as
f is a function of x, like input-x-goes-to-output which is a
temporary object in this environment that R can use
install.packages("dplyr") library(dplyr)
clean the headers (names) so they do not contain
old_names <- names(air_data) # make object of the old names new_names <- old_names %>% # make object of the new names to clean up gsub("'", "", .) %>% # remove single quotation marks gsub("\"", "", .) %>% # remove double quotation marks gsub("%", "percent", .) %>% # change the symbol % to the word percent gsub("^[ ]+", "", .) %>% # this finds any plus signs and removes them
the ^ means at the beginning of the string
make.names(.) %>% gsub("[.]+", "", .) %>% # convert 1 or more periods to a single gsub("[]+", "", .) %>% # fix multiple consecutive underscores tolower(.) # make all letter lower case
see what that did anytime by looking at new_names
new_names
next, check for column names that are not unique
The next 8 rows adds the column number to any columns with the same name
not a big deal unless data has many dozens of columns
R actually did this already in the names(air_data) in row 40
but in case that did not happen, rows 79-86 will add _i to column name
dupe_count <- vapply(seq_along(new_names), function(i) { sum(new_names[i] == new_names[1:i]) }, integer(1)) new_names[dupe_count > 1] <- paste( new_names[dupe_count > 1], dupe_count[dupecount > 1], sep = "" )
by clicking new_names in the environment pane
i see the last space turned into an underscore so remove that
ideally just add the next two lines above before the ) but can do it step-wise
with another object now we call new_names_1
new_names_1 <- newnames %>%
gsub("$", "", .) # remove string-final underscores
now see what that did
new_names_1 air_data_1 <- setNames(air_data, new_names_1) #insert the new names into air_data
and create a new air_data_1 file
head(air_data_1) # to see the first 6 rows
and to see the whole new file in a new tab in this editor pane
click on air_data_1 in the upper right environment pane
Save data as a CSV file to your working directory
write_csv(air_data_1, "air_data_1.csv")