Closed gvwilson closed 10 years ago
For scientists whose understanding of data analysis consists entirely of Excel and/or a GUI stats program (e.g. SPSS, Minitab, etc.), I see the main goals of a SWC R bootcamp as the following:
And then time permitting, other R topics could be covered:
I also think the only other topic covered should be the shell. I have taught version control to many beginners, and I have not found it overly fruitful since we are providing them a solution to a problem they do not yet have.
What do we currently have?
The most complete set of novice R materials we currently have are those in misc-r, which were developed by @jennybc. I think the best material in this set includes the introduction to R working environment and RStudio and the explanation of data types.
However, it still needs lots of attention before it is useful for R bootcamps. It's current limitations include the following:
Going forward
I think we should use @jennybc's introductory lesson as the first lesson for the novice bootcamp. I also think we should adapt her lecture notes on R data types (currently in PDF format) into a flat text format. However, from there I suggest we start from scratch and develop R materials with more exposition on core SWC concepts. We need to decide on what that core content will be and also the format that should be used (please debate the format at #92).
Minor comment; don't stress loops (formal ones, for()
, while()
, etc) too much as to do them properly you need to get into memory allocation to avoid the "loops are slow" rubbish. I suspect that at this level functions like the apply()
family and aggregate()
are probably better to focus on and only mention the formal loops.
I'd urge you not to focus the environment on RStudio; it's great and users should be encouraged to use it if they want, but I'd prefer to teach command-line usage (in the Windows or MacOS GUIs or the Linux shell) and supplement that with some RStudio-specific materials. You'd want students happy and familiar at the command line and then they can use any R interface, including RStudio.
+1 for not really getting into version control for beginners (will elaborate later; basically gentle intro is possible using github only mostly as a consumer)
+1 for using RStudio (-1 for not using it?....)
+1 for less loops and more proper data aggregation; I vote for plyr
+1 for overview of topics in vs out by @jdblischak
My material is in better form, in particular R Markdown vs PDF, in my STAT 545A course materials.
Just to clarify my RStudio point; The majority of the materials should be R IDE-agnostic, a student ought to be able use whatever IDE they want and still make progress. E.g. teach package management via the R functions, not the GUI in RStudio. Etc. Fine to have a set of slides on working with RStudio but keep the other materials agnostic.
I'm +1 for requiring students on bootcamps use RStudio as that gives a common interface for the bc, which if nothing else helps instructors.
-1 for plyr at least as the only data aggregation interface. plyr is a useful tool, but it makes you think about the aggregation issue in a very particular way - i.e. Hadley's way. There's also a huge pile of material out there that doesn't use Hadley's ecosystem. So I'd vote for teaching some of the basic R tools for this (split()
, apply()
family, then the combine step do.call(r/cbind, list)
or simply unlist()
) and then introduce plyr.
One reason for not focussing solely on plyr is that is it awfully slow; so slow you'd never want to use it in production functions. But if you never learn about the tools in base R, which are much faster but more quirky, then plyr is the only tool you reach for. This may change in the future of course with dplyr but that isn't going to be on CRAn for a while.
We have to use RStudio with beginners. The only material that will be RStudio specific is the intro lesson orienting the students to the IDE, searching for help, and using knitr.
I am also +1 on plyr. The main bottleneck for me as a scientist is the initial exploration of the data, the formalization of my analysis into a reusable script, and understanding the code when I return days/weeks/months later when I need to change something. The runtime of the code is extremely negligible in comparison. plyr provides a useful framework that I can keep in working memory while I code. I foresaw this flame war about teaching Hadley vs. base coming. I cast my vote for Hadley.
Thanks for the inflammatory response John - since when did voicing an opinion count as flaming?
I have no problem at all with Hadley's ecosystem of packages - I use many of them daily myself as part of my research and teach them regularly. However, I'm also not keen on equipping useRs with only an opinionated approach to working with and using R (Hadley's word "opinionated", not mine).
Look, if you are producing novice material and intermediate R materials you are clearly expecting that at least some useRs will progress from one to the other. If at the intermediate level and beyond you are introducing package development etc you don't want to be relying on plyr or some other packages that are are more user-oriented. So then you start introducing some of the other tools in base R instead of focussing on package development etc. All I was suggesting is that you include both; explain how to do it with the raw tools R gives you (understanding split-apply-combine is quite informative if you get students to actually split()
a data frame by a factor vector, look at the resulting list, lapply()
a function over it, and recombine the result, by hand), then mention some of the idiosyncrasies (which plyr was created to solve) and then introduce plyr as a user-level tool to simplify data aggregation and analysis steps. Also, working with lists and getting novice users comfortable with subsetting via [
and [[
are useful lessons, which working with the raw tools reinforces.
I see the same issues with plotting; Would you suggest we only use/teach ggplot2 (or lattice) because they make data vis so much easier? At the expense of Base graphics?
@gavinsimpson - Welcome to the Software Carpentry repository! I don't have any direct contributions in this thread, but I wanted to quickly jump in before things get derailed.
I pretty sure that @jdblischak did not mean to imply that you were flaming, just that we were approaching a topic where instructors may be sharply divided. Let's try to assume best faith on here, and if you're feeling personally attacked, please email Greg or me immediately so we can sort things out.
Thanks!
I apologize, @gavinsimpson. I used the term "flame wars" as a succinct way to describe the decision we were trying to make, but I completely see why you understood it to be inflammatory and regret having used it. And thanks to @ahmadia for helping mediate the situation.
As for the solution, it looks like the best course forward from here is to create some beginning materials using base R aggregation functions as @gavinsimpson suggested. This lesson could go before @jennybc's lesson using plyr. Since we seem to agree that the specifics of version control can be left out of a beginner boot camp, that should leave us more time for R lessons like this.
Also, thanks to @jennybc for the links. These are much easier to follow than what is currently in the bc
repo. Any plans on trying to migrate them in the near future? We could use your lessons at the beginning to familiarize the students with R. Then the second part of the bootcamp could be focused on organizing everything into more formal scripts.
Since we seem to agree that the specifics of version control can be left out of a beginner boot camp.
Apologies for jumping into the middle of the conversation. I haven't been following this issue since I was interpreting it as being exclusively about the R portion of bootcamps. It seems to me that a discussion related to removing version control from Intro bootcamps that teach R (and therefore presumably also from Intro bootcamps more generally) is a larger conversation that would require its own issue for discussion, especially since version control material for Intro bootcamps is specifically being developed in #146.
Sorry to disagree again John, but a good argument could be made to keep version control, especially for R scripts, as it allows the tracking of changes to the codes underlying a paper or thesis etc. Using VC speaks to reproducibility of the scientific process and probably should be held up as something important for bootcamp attendees to buy into.
Adding to @ethanwhite's comment, version control is an absolutely core part of Software Carpentry's curriculum. I don't personally think that removing it from any boot camp is even up for discussion, but if someone wants to bring that up it definitely deserves it's own thread since it's not related to R at all.
Hi everyone, +1 to Matt's comment: version control is part of our core (along with modular programming, task automation, and testing). How much and how far depends on the audience --- I've taught Git without ever introducing remotes, for example --- but as per http://software-carpentry.org/faq.html#trademark, we require people to hit these if they want to use our name and logo. Thanks, Greg
A general point I also made on #129 that may get at why we're struggling to design novice vs intermediate R bootcamps following a more Python-y set of SWC principles:
People use Python because they need a general purpose scripting language and/or they need to analyze data.
People learn R because they need to analyze and visualize data, especially if they need statistical tools of at least moderate sophistication. Once they're in deep enough, they might also use it for general programming/scripting stuff, because it's convenient to keep dependencies and interfaces to a minimum.
So the motivation and priorities for the typical boot camper (and instructor, for that matter!) is rather different.
I break R usage into three modes
I think a beginner R SWC bootcamp should emphasize topics in 1. and touch on some in 2. The intermediate can revisit topics in 1. but push really hard on topics in 2. and 3.
Reproducibility is a theme that cuts across both boot camp levels. For beginners, the goal is to get them saving scripts, not using the mouse for anything mission critical, and an intro to dynamic report generation. Maybe writing a pseudo-Makefile = a master R script to run the other? I believe unit testing in R belongs in an intermediate bootcamp, probably in the context of package development.
I have found it awkward to fit version control into R beginners boot camp. Learners are usually desperate to learn more about visualization or knitr
, for example, right when we switch over to version control. The software setup and cognitive load feels huge compared to the payoff at midday of bootcamp day 2. They've barely got anything to put under version control, so the organic interest and engagement is not necessarily there.
@jennybc - Thanks for the comparison, that's really insightful.
Proposed policy re: RStudio: we use it in bootcamps because learners will probably want to use it later and it eliminates great heaps of OS-specific issues with editors, etc. We do NOT rely on RStudio buttons, etc. to accomplish anything mission critical. Exception: we might use Rstudio buttons to get a certain product quickly (e.g. Knit HTML or Compile Notebook). With the desired output in hand, learners are then receptive when we show how to emulate at the command line (e.g. knit()
or stitch()
).
+1 on the proposed policy re RStudio, Jenny.
My view re: plyr
and using the Hadley ecosystem more generally. I've wrestled with the same things that worry @gavinsimpson but have ultimately come down on the side of "yes use them, yes teach them". Even to the exclusion of base stuff.
I apply this to visualization. Yes, I practically ignore base graphics; for years, I emphasized lattice
and am in the process of getting my coverage of ggplot2
up to the same level. Not sure yet if I'll keep using both or will just switch over to ggplot2
. For the beginner, if it's not easy to make a scatterplot with color encoding male vs female and a smooth regression line, you're showing them the wrong approach.
I apply this to data aggregation, especially for beginners. I made the conversion from base R to plyr
also recently and am very happy. I'm a big believer in using data.frames, using factors but using them well, etc. The sanity and predictability of what plyr
functions return is incredibly valuable. I felt like I had clubbed a baby seal every time I inflicted do.call()
on a beginner, which is almost unavoidable using base R functions for data aggregation.
@jennybc - thanks also for the general points comparing the ideas behind python vs R bootcamps.
I disagree a bit about the emphasis on analysis in 1. For domain-specific courses analysis will be a important component, but for a SWC bootcamp? For that I feel that understanding the language is more important, with touching on programming (writing your own functions etc.) also being an important SWC bootcamp (BC) topic, but not necessarily an important domain-specific component at the novice level.
For an BC, I have to agree with other comments on keeping/including version control. Again, there'd be other more important topics if I were giving a domain-specific workshop or course, but for a BC and the things that SWC aims to convey? For me it is essential.
As for them not having anything to version control, what about the script(s) they are writing for examples as they go through the early material. If writing scripts and having them write new scripts for each Section/Part of the BC, all in a single folder, you'd have things to version control at that point and something to work with when you introduced version control topics.
I wonder if part of the issue here is not python vs R, but lessons vs bootcamps? Lessons that people might want to take and adapt for their own domain-specific classes probably don't need to include version control, but a lesson in the a BC probably should.
@jennybc - regarding your plyr comment, and I don't want to prolong that particular discussion, but I do wonder if this also doesn't boil down to the conflict between domain-specific courses/workshops and SWC bootcamps (BC). In a domain specific course I'd have no issue with adopting plyr or ggplot2 or whatever higher-level packages were needed. But my understanding of BCs is that they are a bit different to this, that regardless what programming language being taught, the aim is to convey some basic, critical principles of using your computer effectively. In that situation, I would expect more recourse to the core language and some common key skills (which I mention in an earlier comment).
I disagree about the plotting example; I think it is arguably harder to get a student to do what you propose and understand it in ggplot than in base graphics, which probably only requires 3 lines of code (probably comparable in length but of lower complexity than ggplot). You're also cutting the student off from the majority of plotting code examples available in books, on websites, and used in other packages. If all you want the students to do is make their own plots, then fine, but I suspect those students will probably end up working with code that uses base graphics and then wondering what to do.
Your point about do.call()
is well made :-)
Even in a typical Python boot camp we teach very little about doing data analysis with Python. That's not because the students are not going to do that, they absolutely are (at least we hope). It's because that's not the actual goal of Software Carpentry.
We teach scripting in boot camps because we want to move people toward one-button reproducible workflows, but that's only one aspect of what Software Carpentry is about. If they are going to write code, even just a little, in any language, the tools and methods of software engineering are going to help them and they will have the easiest time adopting those if we start them early.
Teaching R is great, but if that's all you do it isn't a Software Carpentry boot camp. It's an R boot camp.
@jiffyclub Good point. Helps me refine what I find awkward. Maybe it's not R vs Python. I think it's hard to teach these "meta" issues about good programming practices to folks who aren't yet able to do cool and useful things with whatever language we're talking about. Chicken. Egg.
If learners don't have that base competence, the few glimpses you give them of doing useful things -- great example: visualization with ggplot
-- are very tantalizing. Which creates the tension between continuing to show them that stuff versus building their software engineering foundation. That foundation is also incredibly useful -- but it's doesn't have the same immediate gratification.
I didn't mean for this to get so existential. :confused:
On Thu, Nov 14, 2013 at 05:04:09PM -0800, Jennifer (Jenny) Bryan wrote:
@jiffyclub Good point. Helps me refine what I find awkward. Maybe it's not R vs Python. I think it's hard to teach these "meta" issues about good programming practices to folks who aren't yet able to do cool and useful things with whatever language we're talking about. Chicken. Egg.
If learners don't have that base competence, the few glimpses you give them of doing useful things -- great example: visualization with
ggplot
-- are very tantalizing. Which creates the tension between continuing to show them that stuff versus building their software engineering foundation.
Going from “no basic competence” to “does useful things with sustainable development practices” is a big transition. Probably too big for two days ;). I think SWC is about finding a minimal useful thing (plot a regression?), and then focusing on the software development practices in the setting of this simple example. The example needs to be fancy enough to inspire your students, but not so fancy that you spend lots of time on the mechanics of that example (as distinct from the mechanics of generic development). That's why Greg picks examples like “convert among temperature units” in #132.
I feel the same way that @gvwilson, @jiffyclub, and @gavinsimpson feel about version control and the general identity of a SWC Bootcamp. For a little historical perspective I think it's also helpful to take a look at the summary of our discussion about adding R as a language being taught by SWC [1].
I mistakenly thought that the creation of separate novice and beginner bootcamps included the discussion not only of the depth to cover each topic but also which topics to be covered. I apologize for my clear misunderstanding of the goals of the recent reorganization, and I'll strive to be more diligent in my reading in the future. Please disregard my previous uninformed contributions to this thread.
Hi John, Please don't apologize --- any discussion of level necessarily includes some discussion of width, and I should have been much clearer about what was on the table and what wasn't. (I'm discovering that I'm better at talking about call stacks than I am at community management...) Thanks, Greg
@jdblischak - I'm in agreement with Greg. There's no need to apologize, there is always tension in determining the proper course for Software Carpentry materials, and I think you've been a great help these past few months.
Thanks @ethanwhite for that link. That is very helpful. I feel like we -- or at least I -- needed this discussion to sort of think through and rediscover alot of those points for myself. I actually feel the skeleton of the May NC R bootcamps are fairly close in spirit to what a beginner R boot camp should be. I can't revisit this today but am happy to lay out a proposed outline (bullet points) for a beginner bootcamp. Maybe later today or over the weekend. It'll be an evoluation of what @jdblischak put at the very beginning of this issue.
I better understand why so many think version control must be in and can get on board with that.
I do have serious misgivings about formal unit testing in a beginner R boot camp. I think that may truly be an issue where the different nature of R and R users suggests we need to implement SWC principles differently than Python. I think it's intermediate.
+1 for no need to apologize @jdblischak. The work you're doing is awesome. Keep it up.
@jennybc I'm glad it was helpful. I think having more discussion of the scope of a beginner camp is great and appreciate all of the conversation in this thread. @gvwilson's been providing some guidance through the material he's developing, but I think everyone's still wrapping their head around this. For example, I personally think that discussing whether or not to include formal unit testing for beginners would be valuable (in fact Greg and I were just chatting about this this morning). I'd recommend starting a new issue with an appropriately broad title so that anyone who is interested will notice it.
Revision of @jdblischak 's initial proposal at beginning of this issue.
I'm trying to sum up this whole thread. Could we close this issue and reopen discussion on a new issue with this as starting point? And plan actual sessions and lesson there?
Important historical background:
Assume typical learner currently works mostly in Excel or with a GUI stats program.
Day 1 (implement core SWC principles in the context of R):
data.frame
when appropriate.data =
and subset =
arguments, with()
and within()
and the dangers of attach()
.Day 2 (...SWC...):
Makefile.R
but possibly with a proper Makefile
)knitr
)The deal with RStudio: we will use it in bootcamps because, long-term, it does support sustainable workflows. Importantly, it practically eliminates many OS- and editor-related headaches. However, we will discourage over-reliance on, e.g. RStudio buttons and menus, for mission critical pieces of a workflow. Illustration: we might use Rstudio buttons to get a certain product quickly (e.g. Knit HTML or Compile Notebook). Then, with the desired output in hand, learners are then receptive when we show how to emulate at the command line (e.g. knit()
or stitch()
).
Open loops:
plyr
? This could become a separate issue. Or instructors can create a lesson doing it their preferred way and use that. (@jennybc has teaching materials using both and, frankly, that head-to-head comparison triggered to big shift towards plyr
!) Also see #91.Settled questions (?):
lattice
vs. ggplot2
.What are we starting with?
Course materials from @jennybc cover many of these topics. A version got pulled into a SWC repo back in May/June, prematurely. Look here for material that is improved and easier to navigate:
@karthik has submitted a large pull request based on his recent Australian bootcamps.
See #91 for a standlone plyr
lesson.
wow, @jennybc, blown away by your fantastic summary. :+1: that we close the issue and split everything up.
Small closing comment: ggplot2
is extremely popular and more widely used than lattice. Many folks can't leave R for python because of ggplot2. Seems like a good tool to teach. I have a full 2-3 hour section in my notes.
@karthik Agree re: ggplot2
. If we did tackle visualization in SWC, I would vote ggplot2
. Why I even mention lattice
: I left base graphics before ggplot2
was an option, so I got really good at lattice
. I'm working through the 5 stages of grief as I try to get to a similar level of competence with ggplot2
.
I don't think we can trim anything out of Day 1. Therefore, I believe addressing visualization would mean cutting something from Day 2. Which brings us back to the question of emphasis on programming vs data analysis? I think I can make peace with either vision, but we have to pick one.
I'd be happy to see an outline that works ggplot2
in ... what to cut, though?
Great summary and revision @jennybc!
Although unit testing is a core topic, it's role in beginner bootcamps and/or beginner R bootcamps is unclear.
I'm personally comfortable with teaching informal testing instead of unit testing in novice bootcamps (e.g., create a simple test dataset and run the code on it to make sure that it gives the right answers), but this is something we should get @gvwilson's feedback on and perhaps have a more general (language agnostic) discussion about if necessary. To me the most important thing is that we should be making decisions about "beginner bootcamps", not "beginner R bootcamps". They are all SWC bootcamps. The programming component for some is taught in R, some in Python, just like in some cases we teach version control in Mercurial instead of in Git.
Emphasis on analysis vs. programming. Some say programming is priority, following SWC core principles. Some (@jennybc, at least) feel beginners are more likely to do analysis first, then progress to programming, so perhaps the bootcamp should make some concessions to that.
I don't think focusing on analysis (using code) is at all in contrast to the core principles. Everything I typically teach is analysis based and @gvwilson's new novice Python lessons are largely along these lines. For example:
http://nbviewer.ipython.org/urls/raw.github.com/swcarpentry/bc/master/python/novice/01-numpy.ipynb http://nbviewer.ipython.org/urls/raw.github.com/swcarpentry/bc/master/python/novice/03-loop.ipynb
Importance of a storyline. @jennybc and @karthik, at least?, advocate for a unifying dataset and storyline for the bootcamp.
Couldn't agree more.
I'm +1 for closing this and moving to a new issue(s). If @jdblischak is OK with that I think that either he or @jennybc should go ahead and make the shift.
I am +1 on starting a new issue. @jennybc, you can start the new issue.
@jennybc @BernhardKonrad @jduckles I hope the R bootcamp at Miami went well. Do you think some of your materials could be used as the start for the r/novice
lessons? Please let me know if you'd like assistance assembling a PR.
Should be closed soon by #396.
Construct a lesson on R for novices in
r/novice
.