swcarpentry / DEPRECATED-website

DEPRECATED: see https://github.com/swcarpentry/website for the current website.
Other
30 stars 22 forks source link

Create introductory material on R #12

Closed gvwilson closed 9 years ago

gvwilson commented 11 years ago

Create material for a half-day or full-day introduction to programming using R to parallel existing material using Python.

karthik commented 11 years ago

Here's a list of topic suggestions that would be really useful for R newbies to learn in a 1/2 day workshop.

1 R basics: Not a super basic intro * but explain useful concepts that beginners deal with but aren't covered in any detail in intro R books. E.g. things like the difference between S3 and S4 objects. Not the technical details but more practical issues that scientists have to deal with, like accessing and manipulating stuff inside them. The .rprofile, Issues like strings as factors, saving and reloading data (which one to use depending on the context -- like storing intermediate data versus saving data for long-term).

2. R Objects: Learn all about the different types of R objects (what they are, how to figure out the class of any object, creating/ manipulating/ converting objects).

3. Manipulating data: Teach people how to use the apply family (both base and plyr) with LOTS of examples. Work with one small dataset and illustrate how each of these could be used. Depending on the dynamics of the group at this stage, one could stop here and go over concepts already covered or if things are moving well, get into reshaping data (wide / long).

If someone can leave with these set of skills (with well documented example code/data to take home) after the workshop, it would be a huge success. This would allow people to get past most of the frustrating issues that early R users typically experience.

* Assuming everyone in the audience is not a complete R newbie. Folks that have never opened R but vaguely know what it is will not benefit from this level of detail.

dmcglinn commented 11 years ago

I'm really excited Soft Carp is adding R into its curriculum! I like the list Karthik started, and I just wanted to add a few comments and additions to his list.

1. R basics: here are some more basic possibly obvious topics that should be covered:

Developing proper R style could be a somewhat difficult nut to crack because established style guidelines for R are still new and relatively incomplete (to the best of my knowledge); however, in my opinion this is possibly one of the most important topics that should really be drilled in - good style --> readable code --> reproducible science. I agree with Karthik discussing S3 vs S4 objects with R newbies is a bad idea. I have coded in R for years now and I still don't have a good grasp on this. S3 vs S4 could be a good topic for more advanced folks interested in developing their own packages. Karthik if you have a chance can you elaborate on the .rprofile issues you mentioned, I'm curious what you are referring to. How R treats strings as factors is a potential pitfall that I see coming up again and again for newbies and even seasoned R users; however, in my opinion covering R pitfalls should be kept until day 2 of a 2 day workshop if folks are R newbies.

2. R Objects:

3. Manipulating data:

Before any apply functions are even mentioned the basic for loop must be covered. The first apply function should be sapply because this is conceptually the closest to a for loop and therefore easiest to grasp, then apply, mapply, tapply, aggregate, if there is time.

4. Simple linear modeling as a bring it all together example: I think working with simple linear models is a good potential topic to develop a new user’s excitement and understanding of R because very simple interpretable results can be generated with a few lines of code but many potential additional layers of complexity exist. The instructor can choose to develop the student's ability to grasp the concept of the model object containing attributes, how to harness the power of a data.frame object when building the model, how to work with functions, and how to plot results all within a simple well defined common problem that emphasizes R strengths.

karthik commented 11 years ago

Great comments, Dan! The biggest challenge with a half day workshop is that it is half a day. So a workshop crammed with too many topics in such a short time may not work at all.

Although I am a big proponent of well documented and properly styled code, I wouldn't cover that here. It would be great to do on day 3. Same with linear modeling. Scientists learn statistics concepts elsewhere so confouding the two in this short window wouldn't work all that well (and I speak from experience). It would be great if there were more time but not a priority as far as getting started with R. Linear modeling is really well covered in most R books and so a beginner with a good grasp of the nuts and bolts of R could easily tackle this on their own.

As for the .rprofile, the idea there was to cover various options that one could set up for R, including several diagnostic functions to aid with data analysis. Although any functions and options stored in a users' .rprofile aren't reproducible (since they are not typically shared), diagnostic functions are almost never included in final code. Learning how to set these options can be really useful.

PS: Good call on packages. To add to that, also including some additional instructions on where to look for packages to accomplish a particular task, how to set repositories (ties in with .rprofile) would be great.

dmcglinn commented 11 years ago

Hey Karthik! Thanks for pointing out my blunder, I thought they were looking for ideas for a 2 day workshop not half day. Still even with a little more time, your point about the dangers of mixing stats and programing are well taken and should probably be avoided given your past experiences.

With respect to proper style. It may be that there is not enough time to develop proper style that is exhaustive, but if, for example, one is going to teach the concept of a for loop then it seems like a great opportunity to insert good R stylistic points on the fly. Also I personally would like to see R object naming conventions improved (e.g., avoiding dots in object names) and this is not going to happen unless we teach the beginners good naming rules. Lastly, the general thrust of the style rules developed for R could be in many ways similar to the style rules developed for Python programming. So if this R module was part of a longer soft carp course many of the style issues could be handled earlier in the week in a more general best programming practices kind of way.

barryrowlingson commented 11 years ago

I'm in on this - it was something I wanted to do as part of my Software Sustainability Fellowship proposal.

One important thing is to make sure students understand that R is not the interface. By which I mean that R runs on the command line, or inside RStudio, or in EmacsESS. I would recommend RStudio for beginners, just so all the windows are in one place and they have easy editor integration, but they have to be aware that what they are looking at is not R.

I think the linear modelling suggestion is a good one - I wouldn't teach the theory at all, but show the equations of least-squares fitting and implement them in an R function. Its only a few lines in its simplest form and I would hope all the students can grasp the concept of a straight-line least squares fit. Then you can go on to write functions for the residuals, or do plots with straight lines added and so on.

Then you tell them about the lm function and all its goodies, if they need it.

gvwilson commented 11 years ago

Thank you all for the comments, but please also keep in mind that we're not really here to teach them R (any more than we're here to teach them Python). We're here to teach them programming: most particularly, when and how to break things into functions that will fit in their heads and can be re-used, and how to build little tools of their own that play nicely with others (e.g., conform to the Unix stdin-stdout-stdargs conventions). While we're waiting to cut over to Git, have a look at http://svn.software-carpentry.org/swc/5.0/website/book/python.html, .../funclib.html, and .../quality.html. How would you teach the concepts in the keypoints and understanding sections of those chapters using R?

barryrowlingson commented 11 years ago

Nothing but 404's on that link for me.

+1 to all those sentiments, but I don't believe you can learn programming in the abstract without a concrete application. There's only so far you can go with foo, bar, and baz before you have to introduce sufficient context for the student's mind to map what's being taught to their specific application. The students are quite likely to be statisticians or general data mungers and a simple statistical example would probably be appropriate.

Within that framework, you teach compartmentalising things, having sensible defaults, writing tests etc. So you wouldn't write one function that did the fit, computed the residuals, and the r-squared, and p-values and so on and returned the whole lot.

Most of the stuff you want to cover is probably in Patrick Burns' The R Inferno, ideal bedtime reading...

gvwilson commented 11 years ago
  1. Link fixed (sorry about that).
  2. I agree about 'foo' and 'bar' --- the evidence says every example should be real and concrete.
  3. "The R Inferno" is a good guide for people who already understand programming, but my feeling (untested in the classroom) is that it would be telling novices how to solve problems they don't realize they have.

We'll be discussing all of this in the next round of the online study group (http://teaching.software-carpentry.org), which will kick off in January. If you (singular or plural) want to take part, and build some R material as your practical exercises, that would be very cool.

karthik commented 11 years ago

Also agree that foo and bar aren't the way to go. One thing to remember is that when someone takes the initiative to attend a R workshop, more often than not, they already have data to work with and problems to solve. So using concerete examples (the one solid dataset I suggested above, appropriate for the audience) would work well.

R Inferno is too much for anyone attending a intro workshop.

I don't fully know the SWC philosophy but when I've run similar R workshops in the past, my focus has always been to help people become self sufficient, learn how to solve problems and avoid some common pitfalls that beginners face (or with folks like me who learn things the wrong way and then had to unlearn it at some point).

Here is a 2 hour workshop/talk I gave to a bunch of Integrative Biology grad students.

ethanwhite commented 11 years ago

Great to see all of the excitement from R folks and apologies if I caused any confusion about the overall goal. Let me provide a little more context for folks new to Software Carpentry. We are typically running 2 day workshops targeted at folks who already have some basic programming skills (expressions, types, variables, loops, conditionals, and functions) in any language. These workshops cover at least:

  1. The shell
  2. Version control
  3. Good programming (breaking things into functions, stdin, stout, etc. as Greg mention above)
  4. Testing

The first two aren't tied to a programming language, so we don't need to choose one. The second two both require that we teach them in something. For the last decade or so this has been Python. Since we can't count on everyone in the classroom knowing Python (and we don't want to because what we are trying to teach are language agnostic skills) we have to give enough intro to the language to get those who haven't used it up to speed so that they can learn the other material. Since in some areas of science R is becoming the lingua franca we would like be able to teach 3 & 4 in R for some workshops to avoid the extra cognitive load of trying to teach these skills in an unfamiliar language. Typically there would be between 0.5 and 1 days spent in whatever the programming language is depending on how many other topics (e.g., regular expressions, SQL) are covered.

Assuming that we can't assume that everyone in the room has an R background then we'll still need to spin folks up on the basics, including at least one of the list like objects. apply() is probably going too far (we don't teach map() in Python either). We do typically cover reading and writing data. Then the key thing is to gradually build an example from the ground up, talking about how to structure the code in readable and maintainable ways, and how to test it. The example around which all of this is built could be based around a simple science problem with real data and a simple linear model, especially since statistics is the gateway into programming for a number of scientists using R.

We do sometimes teach some Numpy, which would be equivalent to doing more advanced data manipulation and/or linear algebra, so in that context some of the more R specific approaches to working with data could be appropriate. Typically this is taught in place of SQL or another optional topic and is for a relatively short period of time (1-2 hours). This is a good place to note that I'm making some generalizations here from my experience that Greg may need to correct.

Since our community of developers and instructors is Python centric we definitely need help to get this off the ground both in terms of developing the basic material and delivering it at workshops. Thanks again for all of the interest. Let's keep the conversation going about how to best teach these core computational skills in R.

cboettig commented 11 years ago

It's great to see this discussion. Though primarily an R user myself, I have loosely followed Software Carpentry materials online for several years, even administering one of it's surveys on best practices to the other 100 or US students in my computational science graduate fellowship program (e.g. example results)

Thanks Ethan for the outline above (of course 1 & 2 are tied to language, just not tied to python. Is SWC teaching git now in place of svn?) For teaching good development practices in R (literate programming utilities, style guidelines, writing test cases, etc) there are few better places to start then Hadley's devtools. The documentation on the wiki there is very readable and I believe very consistent with SWC philosophy. The devtools sorftware package provides a one-stop-shop for the few utilities that are missing from the base R (e.g. a Doxygen style literate programming system in place of the clumsy native documentation). Clearly there are more advanced topics listed as well, but those are clearly marked and separated. Hadley has a ton of experience teaching SWC concepts (um, industry-standard programming practices?) to people without previous training or who don't see themselves as programmers.

ctb commented 11 years ago

Note that Mark Blaxter in Edinburgh has been running R workshops for biologists for a while now. Here are his materials, battle tested etc.

https://www.wiki.ed.ac.uk/display/AshworthBioinformatics/Ashworth+R+Resources

On Dec 10, 2012, at 6:08 PM, Ethan White notifications@github.com wrote:

Great to see all of the excitement from R folks and apologies if I caused any confusion about the overall goal. Let me provide a little more context for folks new to Software Carpentry. We are typically running 2 day workshops targeted at folks who already have some basic programming skills (expressions, types, variables, loops, conditionals, and functions) in any language. These workshops cover at least:

• The shell • Version control • Good programming (breaking things into functions, stdin, stout, etc. as Greg mention above) • Testing The first two aren't tied to a programming language, so we don't need to choose one. The second two both require that we teach them in something. For the last decade or so this has been Python. Since we can't count on everyone in the classroom knowing Python (and we don't want to because what we are trying to teach are language agnostic skills) we have to give enough intro to the language to get those who haven't used it up to speed so that they can learn the other material. Since in some areas of science R is becoming the lingua franca we would like be able to teach 3 & 4 in R for some workshops to avoid the extra cognitive load of trying to teach these skills in an unfamiliar language. Typically there would be between 0.5 and 1 days spent in whatever the programming language is depending on how many other topics (e.g., regular expressions, SQL) are covered.

Assuming that we can't assume that everyone in the room has an R background then we'll still need to spin folks up on the basics, including at least one of the list like objects. apply() is probably going too far (we don't teach map() in Python either). We do typically cover reading and writing data. Then the key thing is to gradually build an example from the ground up, talking about how to structure the code in readable and maintainable ways, and how to test it. The example around which all of this is built could be based around a simple science problem with real data and a simple linear model, especially since statistics is the gateway into programming for a number of scientists using R.

We do sometimes teach some Numpy, which would be equivalent to doing more advanced data manipulation and/or linear algebra, so in that context some of the more R specific approaches to working with data could be appropriate. Typically this is taught in place of SQL or another optional topic and is for a relatively short period of time (1-2 hours). This is a good place to note that I'm making some generalizations here from my experience that Greg may need to correct.

Since our community of developers and instructors is Python centric we definitely need help to get this off the ground both in terms of developing the basic material and delivering it at workshops. Thanks again for all of the interest. Let's keep the conversation going about how to best teach these core computational skills in R.

— Reply to this email directly or view it on GitHub.

ethanwhite commented 11 years ago

of course 1 & 2 are tied to language, just not tied to python. Is SWC teaching git now in place of svn?

Yes, of course, and yes, we're in the process of generally moving over to git instead of svn (see #1 this recent blog posts). I believe that some workshops may also be teaching Mercurial in cases where that is most appropriate for the audience. Again, the goal is to teach Version Control in general, not a particular system, but this trades off against trying to support too many different versions of each tool.

LJWilliams commented 11 years ago

If we decide to do this, I am willing to help. As much as I think R is great for scientific data analysis, it is probably more useful to focus on a more general programming language (besides most of the R packages I use can be called by Python, which makes learning R somewhat redundant except in special cases). It would be helpful to have some units on R in the list of tutorials though (or at least links to some good R material on the web).