jrnold / r4ds-exercise-solutions

Exercise solutions to "R for Data Science"
https://jrnold.github.io/r4ds-exercise-solutions
Creative Commons Attribution 4.0 International
321 stars 229 forks source link

Exercise 5.7.1.6 typo/bad sentence #127

Closed JamesCuster closed 5 years ago

JamesCuster commented 5 years ago

Hey, when I was looking at your fix for the previous issue on this same question, I noticed that the sentence underneath the second code block is not quite complete as well. Sorry I didn't catch it the first time. This sentence: "Now that we’ve identified potentially bad observations, we would to distinguish between the real problems and"

jrnold commented 5 years ago

Thanks! There's probably a few of those half-completed sentences as I switched between writing the code and trying to explain it. The answer to that question was particularly difficult since it is fairly open-ended, and there are many ways to approach it. Does the current solution make sense? Do you have any suggestions on how to improve it?

JamesCuster commented 5 years ago

Your answer makes sense to me, but I am from a stat background. If I were just working on this on my own I might have used the 1.5*IQR. Two things that I could see being problematic with this answer. First I could see using this approach with this particular book might be over some peoples heads as it is an introductory book so they might not be familiar with what you are doing. Also when considering this approach, what would be your cutoff for a z score that you would think would be to large or small? +- 1.96?

Secondly, the second part of the question says to compare the length of flights with the shortest flight to that destination, so you didn't really do that. However, I think this is a terrible question though. In the first part you are saying hey these short flights might be errors, but then you turn around and use them to determine which flights were extra long?

Haha, so basically that is my long winded way of saying that I like what you have done or the IQR approach.

jrnold commented 5 years ago

In this case, I don't think it's worth using any particular cutoff. I am just using the z-scores as a way to standardize variables. Doing something similar with median and IQR would also work, which is why I mentioned it later. It is probably a preferable method. But, since this chapter is about data transformation, I wanted to keep it simple and avoid any explicit statistical models or discussing anomaly/novelty/outlier detection. I avoided any sort of cutoffs or tests, because I'd rather emphasize the importance of domain knowledge. What is unusual is defined by an understanding of the usual system. It's pretty cool that we have ways to leverage the distribution of the data itself to help identify unusual points, but it's better to use some domain knowledge and/or check the data in these cases.

jrnold commented 5 years ago

I do need to add that comparison to the shortest flight back.