Open jessicasmiller opened 7 years ago
If you just do summary()
, R will tell you (among other things) the min and max values (as well as the number of NA
values, if any). (Here I'm using summary
just for the mpg
column in the built-in mtcars
data set; summary(mtcars)
will give you the summaries for every column)
summary(mtcars$mpg)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 10.40 15.42 19.20 20.09 22.80 33.90
If you have a range of variables in mind, you can use filter()
to select just the rows that are outside this range: in this case I'm going to look for values outside the range (12,32).
library(dplyr)
badrows <- (mtcars %>%
filter(mpg<12 | mpg>32)
)
(I'm putting parentheses around the whole expression here so Jonathan doesn't yell at me)
I can look at this filtered data set by clicking on the little spreadsheety-looking icon in the Data
window in RStudio
If I'm working in the console and want to look only at a few columns, I could quickly select()
a few:
(badrows %>%
select(mpg,cyl,disp))
# mpg cyl disp
# 1 10.4 8 472.0
# 2 10.4 8 460.0
# 3 32.4 4 78.7
# 4 33.9 4 71.1
In fact, the view of the data that I get in RStudio actually has a Filter
button that I can use to do this interactively ...
If I want to get rid of rows, I can use filter()
in the opposite sense:
goodrows <- (mtcars %>% filter(mpg>=12 & mpg<=33))
However, I/we do want to caution you very strongly that you always need a good reason to exclude data: you should never automatically exclude data, you need to use human judgement in order to establish that data points should be excluded.
Hi Ben,
Thank you, this is very thorough and helpful answer! I’ll let you know in class if I run into any troubles with your instructions.
Thanks again! Jess
Jessica Miller MSc candidate
Aquatic Behavioural Ecology Lab (ABEL) Department of Psychology, Neuroscience & Behaviour McMaster University Hamilton, ON L8S 4K1
Phone: 905 525-9140 ext 26037 Fax: 905 529-6225
On Jan 16, 2017, at 4:50 PM, Ben Bolker notifications@github.com wrote:
If you just do summary(), R will tell you (among other things) the min and max values (as well as the number of NA values, if any). (Here I'm using summary just for the mpg column in the built-in mtcars data set; summary(mtcars) will give you the summaries for every column)
summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
If you have a range of variables in mind, you can use filter() to select just the rows that are outside this range: in this case I'm going to look for values outside the range (12,32).
library(dplyr) badrows <- (mtcars %>% filter(mpg<12 | mpg>32) ) (I'm putting parentheses around the whole expression here so Jonathan doesn't yell at me)
I can look at this filtered data set by clicking on the little spreadsheety-looking icon in the Data window in RStudio
If I'm working in the console and want to look only at a few columns, I could quickly select() a few:
(badrows %>% select(mpg,cyl,disp))
mpg cyl disp
1 10.4 8 472.0
2 10.4 8 460.0
3 32.4 4 78.7
4 33.9 4 71.1
In fact, the view of the data that I get in RStudio actually has a Filter button that I can use to do this interactively ...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Ben mentioned in class today that when examining your data, there are ways to get R to flag any data values that it identifies as outliers or that you designate as outside an expected value range. Are there any resources that elaborate on that? I think it would be really helpful to be able to flag and then remove or mask values that you’ve identified as outliers.