mac-theobio / QMEE

Back end for McMaster Bio 708 (Quantitative Methods in Ecology and Evolution)
https://mac-theobio.github.io/QMEE/index.html
1 stars 11 forks source link

Flagging outliers in R #2

Open jessicasmiller opened 7 years ago

jessicasmiller commented 7 years ago

Ben mentioned in class today that when examining your data, there are ways to get R to flag any data values that it identifies as outliers or that you designate as outside an expected value range. Are there any resources that elaborate on that? I think it would be really helpful to be able to flag and then remove or mask values that you’ve identified as outliers.

bbolker commented 7 years ago

If you just do summary(), R will tell you (among other things) the min and max values (as well as the number of NA values, if any). (Here I'm using summary just for the mpg column in the built-in mtcars data set; summary(mtcars) will give you the summaries for every column)

summary(mtcars$mpg)
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 10.40   15.42   19.20   20.09   22.80   33.90 

If you have a range of variables in mind, you can use filter() to select just the rows that are outside this range: in this case I'm going to look for values outside the range (12,32).

library(dplyr)
badrows <- (mtcars %>%
  filter(mpg<12 | mpg>32)
)

(I'm putting parentheses around the whole expression here so Jonathan doesn't yell at me)

In fact, the view of the data that I get in RStudio actually has a Filter button that I can use to do this interactively ...

screen shot 2017-01-16 at 4 46 15 pm

If I want to get rid of rows, I can use filter() in the opposite sense:

goodrows <- (mtcars %>% filter(mpg>=12 & mpg<=33))

However, I/we do want to caution you very strongly that you always need a good reason to exclude data: you should never automatically exclude data, you need to use human judgement in order to establish that data points should be excluded.

jessicasmiller commented 7 years ago

Hi Ben,

Thank you, this is very thorough and helpful answer! I’ll let you know in class if I run into any troubles with your instructions.

Thanks again! Jess


Jessica Miller MSc candidate

Aquatic Behavioural Ecology Lab (ABEL) Department of Psychology, Neuroscience & Behaviour McMaster University Hamilton, ON L8S 4K1

Phone: 905 525-9140 ext 26037 Fax: 905 529-6225

On Jan 16, 2017, at 4:50 PM, Ben Bolker notifications@github.com wrote:

If you just do summary(), R will tell you (among other things) the min and max values (as well as the number of NA values, if any). (Here I'm using summary just for the mpg column in the built-in mtcars data set; summary(mtcars) will give you the summaries for every column)

summary(mtcars$mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

If you have a range of variables in mind, you can use filter() to select just the rows that are outside this range: in this case I'm going to look for values outside the range (12,32).

library(dplyr) badrows <- (mtcars %>% filter(mpg<12 | mpg>32) ) (I'm putting parentheses around the whole expression here so Jonathan doesn't yell at me)

I can look at this filtered data set by clicking on the little spreadsheety-looking icon in the Data window in RStudio

If I'm working in the console and want to look only at a few columns, I could quickly select() a few:

(badrows %>% select(mpg,cyl,disp))

mpg cyl disp

1 10.4 8 472.0

2 10.4 8 460.0

3 32.4 4 78.7

4 33.9 4 71.1

In fact, the view of the data that I get in RStudio actually has a Filter button that I can use to do this interactively ...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.