swcarpentry / DEPRECATED-bc

DEPRECATED: This repository is now frozen - please see individual lesson repositories.
Other
299 stars 383 forks source link

R motivational slides #614

Closed sritchie73 closed 9 years ago

sritchie73 commented 10 years ago

Created some motivational slides for why you should learn and use R. Issues that need solving:

rgaiacs commented 10 years ago

What about mention mixing R with LaTeX or Markdown?

jdblischak commented 10 years ago

+1 to @r-gaia-cs 's suggestion to highlight the ability to create nice, well-documented reports from your analyses.

These slides do a good job on the long-term motivations for using R, which is that no matter what complicated a problem you will encounter in the future, you will likely not have to start from scratch, and you can also extend it yourself relatively easily. But what about the short term for those completely new to programming? I would argue that if you have a tabular data set, R is the quickest option to start learning about your data as you learn the language. Consider this small example:

my_dat <- read.table("data.txt")
summary(my_dat)
boxplot(continous_var ~ categorical_var, data = my_dat)

This is so empowering because you can get to this level after only a little bit of learning R. Most other languages have a much steeper learning curve. And even in Python, you would first have to import pandas and matplotlib and be familiar with calling methods with dot notation before accomplishing something similar.

benmarwick commented 10 years ago

This looks wonderful, concise and covers all the main selling points. I have two suggestions for additions, perhaps just two bullet points somewhere on the slides:

First, I'd also add somewhere that R has a very large community of users who are generally very helpful, and add mention some of the bigger and better sources of free online information like http://stackoverflow.com/questions/tagged/r, http://www.statmethods.net/, http://www.twotorials.com/ and so on. And package authors are generally willing to help users with their packages. This is a very important detail because people who have grown up with a commercial stats package will be anxious about switching to R and not having a help line that they are entitled to call because of their licensing fees. They will want to know that help is available for R, but it's not in the form that they might be used to!

Second, I'd make a brief mention of how R improves the reproducibility and transparency of research. People using point-and-click stats packages are typically not very aware of this issue (because of how hard it is to do reproducible research with a point-and-click interface), and as a script-driven environment R gives it to them for free. Using R, a researcher can script an analysis that they can run over and over with different data, with different projects, and give to someone else to use (ie. students) and verify . They can also publish their code online for others to inspect and validate their analyses, and so on. Since the target audience here is mostly non-programmers, this benefit of openness that we get from a stats package based on scripting is likely to make quite an impression. This might be a good topic to connect to LaTeX and Markdown as @jdblischak and @r-gaia-cs suggest.

sritchie73 commented 10 years ago

All worth mentioning, I didnt think of adding those points: I was trying to identify things that R has that python doesn't. On 24/07/2014 12:16 AM, "Ben Marwick" notifications@github.com wrote:

This looks wonderful, concise and covers all the main selling points. I have two suggestions for additions, perhaps just two bullet points somewhere on the slides:

First, I'd also add somewhere that R has a very large community of users who are generally very helpful, and add mention of the bigger and better sources of free online information like http://stackoverflow.com/questions/tagged/r, http://www.statmethods.net/, http://www.twotorials.com/ and so on. And package authors are generally willing to help users with their packages. This is a very important detail because people who have grown up with a commercial stats package will be anxious about switching to R and not having a help line that they are entitled to call because of their licensing fees. They will want to know that help is available for R, but it's not in the form that they might be used to!

Second, I'd make a brief mention of how R improves the reproducibility and transparency of research. People using point-and-click stats packages are typically not very aware of this issue (because of how hard it is to do reproducible research with a point-and-click interface), and as a script-driven environment R gives it to them for free. Using R, a researcher can script an analysis that they can run over and over with different data, with different projects, and give to someone else to use (ie. students) and verify . They can also publish their code online for others to inspect and validate their analyses, and so on. Since the target audience here is mostly non-programmers, this benefit of openness that we get from a stats package based on scripting is likely to make quite an impression.

— Reply to this email directly or view it on GitHub https://github.com/swcarpentry/bc/pull/614#issuecomment-49879350.

sritchie73 commented 10 years ago

A couple of counter arguments:

But I agree with you, definitely the easiest and fastest language for plotting otherwise! On 23/07/2014 11:41 PM, "John Blischak" notifications@github.com wrote:

+1 to @r-gaia-cs https://github.com/r-gaia-cs 's suggestion to highlight the ability to create nice, well-documented reports from your analyses.

These slides do a good job on the long-term motivations for using R, which is that no matter what complicated a problem you will encounter in the future, you will likely not have to start from scratch, and you can also extend it yourself relatively easily. But what about the short term for those completely new to programming? I would argue that if you have a tabular data set, R is the quickest option to start learning about your data as you learn the language. Consider this small example:

my_dat <- read.table("data.txt")summary(my_dat) boxplot(continous_var ~ categorical_var, data = my_dat)

This is so empowering because you can get to this level after only a little bit of learning R. Most other languages have a much steeper learning curve. And even in Python, you would first have to import pandas and matplotlib and be familiar with calling methods with dot notation before accomplishing something similar.

— Reply to this email directly or view it on GitHub https://github.com/swcarpentry/bc/pull/614#issuecomment-49874588.

gvwilson commented 10 years ago

It's fine to have arguments pro and con in motivational slides - if you're the first person to point out something's limits, it makes the rest of what you say more believable.

cboettig commented 10 years ago

Looks great. Listing some cons wise -- I'd just say the syntax is more challenging/frustrating than most, (so that users don't get too discouraged when they struggle with the use of ~ or y <- x[["a"]]$b.v[1])

@sritchie73 I'm surprised that you find data cleaning harder in R, I would have listed that as one of it's greatest strengths! Have you had a read through http://vita.had.co.nz/papers/tidy-data.html or more recent http://blog.rstudio.org/2014/07/22/introducing-tidyr/ ?

gavinsimpson commented 10 years ago

@sritchie73 @gvwilson It is important to be neutral in providing pros and cons and some of those cons are very personal. stringsAsFactors = TRUE is, as far as I am concerned, just great; it stops people doing silly things with categorical data when fitting statistical models and let's not forget that is why R exists in the first place. It's also easy to work around once you are aware of the issue.

boxplot() doesn't require use of a formula; it's a more user friendly way to plot multiple boxes, but you can call the function with a matrix/data frame. Also, if one introduces plot(y ~ x, data = foo) rather than plot(foo$x, foo$y) early on, users soon learn the utility of formulas in R and how to write and what they mean.

For cleaning data, I rarely use anything else but R for this. Probably that's because I know a lot more R than bash or python or some other such language.

gavinsimpson commented 10 years ago

Rather than just tell people why R is so great, why not show them an excellent example?

One that is used quite often for a more statistically-minded group is creating a bootstrap confidence interval on the kernel density estimate of the Old Faithful Waiting Time data (faithful$waiting), using replicate() to do the bootstrapping. A handful of lines of code gets you the KDE, a bootstrap confidence interval and a plot with very little effort. Doing this in another language would require a lot more coding effort. The point though is not to poke fun at the other languages but to highlight that R as a language is designed for rapid, interactive data analysis.

Code

kde <- with(faithful, density(waiting))
from <- min(kde$x)
to <- max(kde$x)
boots <- with(faithful, replicate(10000, {
    samp <- sample(waiting, replace = TRUE)
    density(samp, from = from, to = to)$y
}))
ci <- apply(boots, 1, quantile, probs = c(0.025, 0.975))
plot(kde, ylim = range(ci))
polygon(c(kde$x, rev(kde$x)),
        c(ci[1, ], rev(ci[2, ])), col = "grey", border = FALSE)
lines(kde, lwd = 2)

faithful-boot

naupaka commented 10 years ago

Another +1 to adding a mention of knitr/rmarkdown to create nice reports. People are always impressed when I show them how easy it is to make a nice deliverable (html or pdf) to share with their collaborators and/or PI. I would also mention what a great resource RStudio is for coding in R, particularly for novices - built-in help, objects browser, tab completion(!!), not to mention more advanced things like git integration, etc. This makes R a lot more familiar for people coming from e.g. something like MATLAB.

I think it may also be worth mentioning briefly that different disciplines tend to have different 'default/go-to' languages. In ecology, for example, R is certainly the 'go-to', which means a lot of code from manuscripts, or for new analysis methods, is in R. I think other disciplines have other defaults, e.g. python or MATLAB or etc.

dhaine commented 10 years ago

Regarding data cleaning, an R novice might be novice to other languages too. So I don't think it will help him/her to refer to other languages (depends on the audience). Also as a novice, you might prefer to do as much as you can in a single environment (i.e. all in R)

benmarwick commented 10 years ago

@dhaine agree completely with both of those points. Seems a more consistent approach with the novice student as someone coming to the command line for the first time.

sritchie73 commented 10 years ago

Great to see a lot of discussion in the second day of the bootcamp!

@cboettig agreed on the syntax, but better to not demotivate novices by saying its hard straight away. Personally I think the major reason things are so difficult is most courses don't teach the basic data structures and how to access them, instead focussing solely on statistics, so we've all had to struggle through it. I'm a big advocate of Hadley Wickham's Advanced R in that regard.

I haven't seen tidyr, I will have to check it out!

@dhaine I completely agree with you. It's not helpful for novices to compare to other languages (unless they come from a Comp Sci background), but we also shouldn't be advocating any one language as the be-all and end-all solution.

@gavinsimpson stringsAsFactors is a doubled edged sword, its good for making sure categorical data is handled correctly, but I've been bitten a number of times in the past where its clobbered my id column (where row.names=1 hasn't automatically worked), and has caused problems downstream when merging data frames, or making conclusions about which variables are significant.

On giving motivating examples, I believe @gvwilson 's intention was to have the pitches to be quite short, ~3 mins each. See pitch.html for the swcarpentry pitch, its quite short, so I'm not sure how much room there is for a motivating example.

gavinsimpson commented 10 years ago

@sritchie73 as a pitch, you aren't going to need to explain what each line of code does, you just need to explain the general steps (KDE in line 1, bootstrap on lines 4-7, CI on line 8, rest plotting in the above example), point to the efficiency of the small amount of code needed to do this and point to the result. We don't even need to have just one example but perhaps a few to choose from or insert your own favorite.

The pitch needs to be more than trust me on these things language x is great because it will save you time / allow you to do x, y, & z once you've invested a bit of time learning. At least that's been my experience.

My point re stringsAsFactors is that your personal experience with this shouldn't colour the presentation. If you mention it at all you need to indicate the utility of the default and point out the negative that it might result in data being stored in factor rather than character formats. To be honest, if you don't have time to include a great motivating example why are we even discussing a minor gotcha that is quick to work around in the pitch? There are far more important negative sides to R like relative slow speed, inconsistent function and argument naming in the base language and functionality, etc.

Re your reply to @dhaine I agree that knowing about a range of languages is helpful, right tool for the job and all, but that doesn't mean R isn't an easy or useful language for data manipulation/processing. This isn't a negative against R, it isn't bad at data processing.

ramnathv commented 10 years ago

My 2 cents on the motivating example idea. I strongly concur with @gavinsimpson that to gain people's trust, it is best to walk the talk and show them how a few lines of code could do something for them, that would typically take many lines of code in other languages. Having a carefully curated list of examples will allow instructors to pick the one most relevant to their audience.

gvwilson commented 9 years ago

@jdblischak @dhaine Please merge if you think this is close enough.

jdblischak commented 9 years ago

My understanding is that this content is being merged into #628.