swcarpentry / DEPRECATED-bc

DEPRECATED: This repository is now frozen - please see individual lesson repositories.
Other
299 stars 383 forks source link

Edits to make novice R lesson 01 more similar to novice Python lesson 01. #639

Closed jdblischak closed 9 years ago

jdblischak commented 9 years ago

I have made edits so that the first R lesson more closely follows the Python lesson. It covers all the same topics save two:

Here's how you can quickly compare the two rendered files:

# clone bc repo
git clone git@github.com:swcarpentry/bc.git
# create and switch to new branch
git checkout -b review-pr
# add remote
git remote add jdb git@github.com:jdblischak/bc.git
# pull changes from my branch
git pull jdb 01-starting-with-data
# build site locally
make site

Then compare the rendered files _site/novice/r/01-starting-with-data.html and _site/novice/python/01-numpy.html.

cboettig commented 9 years ago

@jdblischak Mostly looks good to me, though as I'm not familiar with the intro python lessons you might want to ignore my comments...

Overall very nice tutorial though! Hope I didn't get too bogged down in the subjective and the minutia

jdblischak commented 9 years ago

Thank you very much for the thorough review, @cboettig. All your suggestions are extremely reasonable. Most of the points you raise, e.g. matrix versus data frame, were choices I made to keep the lesson in line with the novice Python lesson. To give you some context, back in March all the SWC instructors interested in developing R lessons met to discuss our strategy. You can read the full summary, but the tl;dr version is that we chose to mirror the Python lessons instead of creating new R-specific lessons. This obviously has its pros and cons. The main pro is that it keeps the focus of the lessons in good programming practices in general, which can also be seen as its greatest con since it largely ignores wonderful tools like those in the Hadleyverse.

So the goal of this set of lessons is not so much "Learn how to use R most effectively", it is "Teach novices how to program a computer to analyze data (using R)." The hope is that these best practices, e.g. modularity, don't repeat yourself, defensive programming, etc. will translate no matter what language they end up using.

cboettig commented 9 years ago

@jdblischak that makes perfect sense, thanks for the context. Of the options you list, I agree that translating lessons that you know work makes more sense than the alternatives, as does emphasizing such language-agnostic practices.

I think there is still an open question as to what is considered a translation though. A matrix is the 'base-class' of matlab, but I would argue that it is not the 'base class' of R, and that the translation would be more complete by introducing the language's workhorse data structure rather than the one that best corresponds to the data structure in python. (After all, it's because you are most concerned with language agnostic concepts that you shouldn't care if the translation is the most bitwise equivalent to the python lesson, but rather that it is the best expression of those bigger concepts in whatever language it is expressed. To go to the logical extreme, no one would suggest that calling python from R was the best 'translation' into R, though it is clearly the most consistent).

I'm not sure that I would suggest using the Hadleyverse tools either, but that is probably more a question of course content than of language. Teaching that data headings matter, that code should be self documenting rather than referring to numeric column indices, etc, may just not be part of the syllabus at this stage, so I see why you would ignore it.

Nevertheless, I don't think using either such practices or explicit tools from hadleverse is inconsistent with the lesson goals. I'm not experienced enough to know which vocabulary (e.g. base R or Hadley's) is most effective at teaching students those bigger goals. Perhaps it makes no difference.

jdblischak commented 9 years ago

You mention classes data.frame and matrix without giving much of a sense of what these object classes are or how they differ (which is, I believe, a frequent source of confusion).

This is by design. We have found that listing all the data types and data structures does not work well with complete novices. They have no context for why these details are important. This is why these novice lessons were originally designed (in Python) to be data-driven. We want to demonstrate why writing code to automate your analysis is such a better way to do science than using spreadsheet software. Observing how to perform a data analysis will give them the context to better understand data types and structures when it comes time to learn more advanced material.

Personally I would not introduce the matrix object at this time, and teach people to always work with data.frames until there is a compelling reason not to...

Converting the data to a matrix allows us to perform max, min, mean, and sd on the entire dataset. Also it allows us to use the heatmap function.

I would likewise focus on subsetting data.frames. Perhaps the matrix class is a more natural analogue to the python lesson, and it certainly is to other common languages, but I don't think that's a compelling reason to focus on matrices over data.frames in R.

But the example used is numeric data. Thus I think it does make sense to use matrix. The only reason it was a data frame is because that is what is return by read.table, which is a lot simpler than matrix(scan(.... I totally agree that the data frame is the most important data structure in R, but the data set that was designed for these novice lessons does not facilitate using data frames. For example, subsetting with integers is exactly the same for a matrix and a data frame, so this will transfer well when they do use a data frame.

I'd spend a moment to explain what an R function is (not how to write one, but the fact that arguments are named, that you must give them in the order named or better specify them as name-value pairs).

Good point. I'll think about a way to discuss this more in depth after introducing read.table.

I would spend this time getting users to think about having well structured data instead, and then how to work with functions across that data in a meaningful way. The current example is very abstract...

The data is supposed to be 60 patients as the rows and 40 days as the columns. When discussing apply for example, we connect taking the mean along "dimension 1" with taking the mean inflammation value per patient. And similarly, "dimension 2" is days. This is explained at the beginning of the lesson. Perhaps we could reiterate what the data is later in the lesson when it is being read in?

A heatmap feels like an un-intuitively complicated function for a first plot, but what do I know.

Totally agree. This is for the sake of consistency. Visualizing data as a heatmap returns in a later lesson, so I couldn't remove this.

Maybe this is crazy, but I wonder if it would be worth teaching certain idioms from the hadleyverse from the beginning?

I agree that using tools from the Hadleyverse does enhance the R experience. For now our plan is to use mainly base R in the main set of lessons, and then have supplementary lessons on useful R libraries, e.g. see PR #621 on ggplot2.

Overall very nice tutorial though! Hope I didn't get too bogged down in the subjective and the minutia

Thanks! This lesson has been the result of the cumulative effort of multiple dedicated SWC instructors. We appreciate the feedback.

cboettig commented 9 years ago

Thanks for the detailed replies.

You clearly have a well developed and well tested lesson here, and I'm glad to learn from your experiences.

Just for the sake of clarity, I do not think being numeric justifies the use of a matrix; there's nothing wrong with a data frame in which all columns are numeric, which is why read.table returns a data frame in the first place. (Besides noting that a date class may be a more appropriate). I agree that you don't want to get into all the object classes here, so it would be better to avoid the examples that explicitly refer to changing the class, right? (At least perhaps a disclaimer that you will explain later why you've switched from data.frame to matrix and what the differences are).

Right, I realize the data in the example doesn't have variables as columns, but treats one column ids as an implicit variable and row ids as an implicit variable. While it has no impact on learning programming skills per se, I believe students would be better served in the long run if they weren't taught bad habits in working with data (I realize that's this is probably an issue in the python lesson as well, so maybe it should be left for another time).

On Fri, Aug 1, 2014 at 2:41 PM, John Blischak notifications@github.com wrote:

You mention classes data.frame and matrix without giving much of a sense of what these object classes are or how they differ (which is, I believe, a frequent source of confusion).

This is by design. We have found that listing all the data types and data structures does not work well with complete novices. They have no context for why these details are important. This is why these novice lessons were originally designed (in Python) to be data-driven. We want to demonstrate why writing code to automate your analysis is such a better way to do science than using spreadsheet software. Observing how to perform a data analysis will give them the context to better understand data types and structures when it comes time to learn more advanced material.

Personally I would not introduce the matrix object at this time, and teach people to always work with data.frames until there is a compelling reason not to...

Converting the data to a matrix allows us to perform max, min, mean, and sd on the entire dataset. Also it allows us to use the heatmap function.

I would likewise focus on subsetting data.frames. Perhaps the matrix class is a more natural analogue to the python lesson, and it certainly is to other common languages, but I don't think that's a compelling reason to focus on matrices over data.frames in R.

But the example used is numeric data. Thus I think it does make sense to use matrix. The only reason it was a data frame is because that is what is return by read.table, which is a lot simpler than matrix(scan(.... I totally agree that the data frame is the most important data structure in R, but the data set that was designed for these novice lessons does not facilitate using data frames. For example, subsetting with integers is exactly the same for a matrix and a data frame, so this will transfer well when they do use a data frame.

I'd spend a moment to explain what an R function is (not how to write one, but the fact that arguments are named, that you must give them in the order named or better specify them as name-value pairs).

Good point. I'll think about a way to discuss this more in depth after introducing read.table.

I would spend this time getting users to think about having well structured data instead, and then how to work with functions across that data in a meaningful way. The current example is very abstract...

The data is supposed to be 60 patients as the rows and 40 days as the columns. When discussing apply for example, we connect taking the mean along "dimension 1" with taking the mean inflammation value per patient. And similarly, "dimension 2" is days. This is explained at the beginning of the lesson. Perhaps we could reiterate what the data is later in the lesson when it is being read in?

A heatmap feels like an un-intuitively complicated function for a first plot, but what do I know.

Totally agree. This is for the sake of consistency. Visualizing data as a heatmap returns in a later lesson, so I couldn't remove this.

Maybe this is crazy, but I wonder if it would be worth teaching certain idioms from the hadleyverse from the beginning?

I agree that using tools from the Hadleyverse does enhance the R experience. For now our plan is to use mainly base R in the main set of lessons, and then have supplementary lessons on useful R libraries, e.g. see PR #621 https://github.com/swcarpentry/bc/pull/621 on ggplot2.

Overall very nice tutorial though! Hope I didn't get too bogged down in the subjective and the minutia

Thanks! This lesson has been the result of the cumulative effort of multiple dedicated SWC instructors. We appreciate the feedback.

— Reply to this email directly or view it on GitHub https://github.com/swcarpentry/bc/pull/639#issuecomment-50938032.

Carl Boettiger UC Santa Cruz http://carlboettiger.info/

jdblischak commented 9 years ago

OK. I added a few more notes on how to call a function. There are more details on default parameters and when you can or cannot pass an object without naming it in the next lesson, so I didn't want to delve into too much detail.

I do not think being numeric justifies the use of a matrix; there's nothing wrong with a data frame in which all columns are numeric, which is why read.table returns a data frame in the first place.

I believe students would be better served in the long run if they weren't taught bad habits in working with data (I realize that's this is probably an issue in the python lesson as well, so maybe it should be left for another time).

I understand your concerns, but I don't see how I can appease them without creating a whole new lesson. Another way to represent the data is a data frame with the following columns:

day patient inflammation
1 1 10
1 2 5
1 3 0
1 4 16
... ... ...

But then the sections on slicing rows and columns, calculating summary statistics along the rows or columns, plotting a heatmap, etc. would all have to be completely different.

And if we leave the data in the same dimensions but as a data frame, then the calls to heatmap, mean, and sd would all have to be changed to heatmap(as.matrix(dat)), mean(as.matrix(dat)), and sd(as.matrix(dat)). I prefer my solution of converting it to a matrix instead of using nested function calls.

cboettig commented 9 years ago

Okay, sounds good to me.


Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Aug 7, 2014 2:41 PM, "John Blischak" notifications@github.com wrote:

OK. I added a few more notes on how to call a function. There are more details on default parameters and when you can or cannot pass an object without naming it in the next lesson, so I didn't want to delve into too much detail.

I do not think being numeric justifies the use of a matrix; there's nothing wrong with a data frame in which all columns are numeric, which is why read.table returns a data frame in the first place.

I believe students would be better served in the long run if they weren't taught bad habits in working with data (I realize that's this is probably an issue in the python lesson as well, so maybe it should be left for another time).

I understand your concerns, but I don't see how I can appease them without creating a whole new lesson. Another way to represent the data is a data frame with the following columns: day patient inflammation 1 1 10 1 2 5 1 3 0 1 4 16 ... ... ...

But then the sections on slicing rows and columns, calculating summary statistics along the rows or columns, plotting a heatmap, etc. would all have to be completely different.

And if we leave the data in the same dimensions but as a data frame, then the calls to heatmap, mean, and sd would all have to be changed to heatmap(as.matrix(dat)), mean(as.matrix(dat)), and sd(as.matrix(dat)). I prefer my solution of converting it to a matrix instead of using nested function calls.

— Reply to this email directly or view it on GitHub https://github.com/swcarpentry/bc/pull/639#issuecomment-51536031.

jdblischak commented 9 years ago

@sarahsupp, @dbarneche, @gavinsimpson, @naupaka, and @jainsley. I'd be interested in your feedback.

gavinsimpson commented 9 years ago

I'm going to push back on the use of a matrix here. Seems to me that focussing on that because we can run min(), max() etc on the matrix is a weak argument to me. I echo @cboettig's comments here; users are going to encounter data frames much more frequently than matrices.

This is one of those tensions between Python and R and following the Python lessons too closely potentially at least makes for clumsy or non-R-like conventions being followed just for the sake of parity with the Python lessons.

Also, and I thought this was the case, you can use min() and max() on a data.frame, just not mean() which seems bone-headedly stupid:

> df <- data.frame(A = 1:10, B = rnorm(10))
> min(df)
[1] -1.768958
> max(df)
[1] 10
> mean(df)
[1] NA
Warning message:
In mean.default(df) : argument is not numeric or logical: returning NA

I would suggest mean(as.matrix(df)) could be a useful compromised until such a time as R becomes less stoopid in this regard?

Better would be to do:

mat <- as.matrix(dat)
mean(mat)
sd(mat)

I'm not sure you can argue that doing heatmap(as.matrix(dat)) is inconvenient as you chose to add this plot type in your revisions ;-)

gavinsimpson commented 9 years ago

I'm going to have to pull the changes locally as the diff in github in my Chrome is losing spaces and possibly other characters. If some of the line-notes above are due to this, please accept my apologies and ignore them.

gavinsimpson commented 9 years ago

General Comments:

  1. There is inconsistency in the way functions are referred to; with or without parentheses: read.csv vs read.csv(). I'm pro read.csv() as it helps to reinforce a distinction between functions and objects (which we call variables here).
  2. I'm not sold on the heatmap idea. If you aren't used to these and the concept of hot colours, you need to have your brain dialed in to the fact that red is cold, yellow hotter, white, scorching. Would a matplot(dat) not be more generally applicable here (perhaps with type = "l" to get lines, one per patient)? I certainly don't see the need to have two heatmaps in here; if we're set on retaining them, perhaps keep just the second one?
  3. Forcing the conversion from a data frame to a matrix is something I'm not keen on, as mentioned above. In my opinion it is better to work around the infelicity in the case of mean() on a data frame than force working with a matrix.

Minor comments:

  1. I really don't like the terminology used in some places. "parameters" for "arguments" and "variables' for "objects". Which is more confusing for novice students; being exposed to a slightly odd term from programming parlance or learning something more familiar ("parameters") which they have to unlearn when the delve deeper and start making use of online materials from outside the SWC sphere? Likewise "variables" which is confusing with the same term in statistics, which is predominantly why people use R even if this is not the motivation behind the SWC lessons or bootcamps.
  2. Where you explain apply() and the MARGIN argument, it would be good to link the use of MARGIN = 1 in the text with the actual call. Hence use apply(dat, MARGIN = 1, mean) say. That makes it clear what the 1 or 2 are. Then say that commonly the name is dropped, hence apply(dat, 2, mean). Might help to reinforce what this is and why you called it MARGIN earlier.
  3. We need to be careful to not propagate the behaviour of having lengthy variable/object names. There is a compromise between being terse or verbose and clarity of code. avg_patient_inflammation for example exceeds acceptable limits for me - this shouldn't be an exercise in typing.

Overall though @jdblischak this is looking really nice - especially the additions of the images and the links to other SWC content for the terminology.

jdblischak commented 9 years ago

Thanks to everyone for the recent reviews. Since @gavinsimpson nicely summarized many of the main points we have been discussing, I'll respond to his comments below:

There is inconsistency in the way functions are referred to; with or without parentheses: read.csv vs read.csv(). I'm pro read.csv() as it helps to reinforce a distinction between functions and objects (which we call variables here).

I removed the parentheses from the inline references to functions because the Python lessons do not use them. I have no strong opinion on the matter, but for the sake of consistency I think such a change should be proposed more widely (and voted on at a lab meeting if necessary).

I'm not sold on the heatmap idea. If you aren't used to these and the concept of hot colours, you need to have your brain dialed in to the fact that red is cold, yellow hotter, white, scorching. Would a matplot(dat) not be more generally applicable here (perhaps with type = "l" to get lines, one per patient)? I certainly don't see the need to have two heatmaps in here; if we're set on retaining them, perhaps keep just the second one?

The heatmaps are there for foreshadowing. In lesson 04, rblocks are used to create a heatmap. Instead of running heatmap and suppressing the dendograms, we could alternatively run image on the transposed (t) data.

Forcing the conversion from a data frame to a matrix is something I'm not keen on, as mentioned above. In my opinion it is better to work around the infelicity in the case of mean() on a data frame than force working with a matrix.

users are going to encounter data frames much more frequently than matrices.

I agree that they will be more likely to encounter data frames, but all the functions that are performed on the matrix in this lesson work equally as well on a data frame. But I have now been out-voted, so I will switch back to a data frame. Which of these options would everyone prefer?

mean(as.matrix(dat))
# or
mat <- as.matrix(dat)
mean(mat)
# or something else

I really don't like the terminology used in some places. "parameters" for "arguments" and "variables' for "objects". Which is more confusing for novice students; being exposed to a slightly odd term from programming parlance or learning something more familiar ("parameters") which they have to unlearn when the delve deeper and start making use of online materials from outside the SWC sphere? Likewise "variables" which is confusing with the same term in statistics, which is predominantly why people use R even if this is not the motivation behind the SWC lessons or bootcamps.

Adding the "parameters" terminology was just another attempt at consistency. However, we do have the SWC blessing to use "arguments" instead of "parameters" (lab meeting summary and Issue #511).

I am not aware of any SWC discussion on "variable" vs "object", nor could I find any on bc or the discuss list. Anecdotally, when I first started learning programming I used the term variable because my mental model was that it was like a variable in algebra. However, I think the main SWC discuss list would be the place to have this conversation since it is not specific to R.

Where you explain apply() and the MARGIN argument, it would be good to link the use of MARGIN = 1 in the text with the actual call. Hence use apply(dat, MARGIN = 1, mean) say. That makes it clear what the 1 or 2 are. Then say that commonly the name is dropped, hence apply(dat, 2, mean). Might help to reinforce what this is and why you called it MARGIN earlier.

I struggled with this decision because the discussion of naming params/args is in the next lesson. If we name MARGIN, then I think we would need to name all of them. apply(X = dat, MARGIN = 1, FUN = mean) just looks so bad. I'll do it this verbose way and then add a comment as you recommended.

We need to be careful to not propagate the behaviour of having lengthy variable/object names. There is a compromise between being terse or verbose and clarity of code. avg_patient_inflammation for example exceeds acceptable limits for me - this shouldn't be an exercise in typing.

I'm open to suggestions.

Overall though @jdblischak this is looking really nice - especially the additions of the images and the links to other SWC content for the terminology.

Thanks!

gavinsimpson commented 9 years ago

I removed the parentheses from the inline references to functions because the Python lessons do not use them. I have no strong opinion on the matter, but for the sake of consistency I think such a change should be proposed more widely (and voted on at a lab meeting if necessary).

Sounds good to me. I'm happy to go with the consensus as it is only a personal preference to differentiate objects from functions in the narrative.

The heatmaps are there for foreshadowing. In lesson 04, rblocks are used to create a heatmap. Instead of running heatmap and suppressing the dendograms, we could alternatively run image on the transposed (t) data.

I hadn't realised that (I just had a look). Whilst I see your point of foreshadowing, seems to me introducing something complicated like a heatmap with dendrograms etc in lesson one doesn't gain us much when you could just explain what a heat map is in lesson 4?At this stage (lesson 1) we're really just looking at the data over time and the matplot(dat, type = "l") (possibly matplot(t(dat), type = "l")?) gives a reasonable overview of those data coded by patient. Wonder what other people think?

But I have now been out-voted, so I will switch back to a data frame. Which of these options would everyone prefer?

As you need to do sd(dat) too, I'd prefer

mat <- as.matrix(dat)
mean(mat)
sd(mat)

Good news re parameters vs arguments. Do you want to raise the objects vs variables part on SWC discuss?

I'll do it this verbose way and then add a comment as you recommended

I don't think you have to go this far. Perhaps do it the other way round; use the simple versions without named arguments but add a note that this is as if you'd called apply(dat, MARGIN = 1, mean) just to reinforce where the 1 comes from? Would that be a more acceptable compromise?

jdblischak commented 9 years ago

I think I addressed most of the main concerns in this latest round of commits. Please let me know if there is anything else you would like me to address. Also, please vote +1 when you approve this PR for merging.

Gavin wrote:

Good news re parameters vs arguments. Do you want to raise the objects vs variables part on SWC discuss?

The variable v. object distinction does not bother me that much, so I'll let you make the case on the discuss list. Either way, it would be an easy enough change to make in a future PR if we decide to change the terminology, so I'd prefer if it did not inhibit this PR from being merged.

So one last change I would like to propose. I find the use of both a data frame and a matrix to be quite awkward, and it repeatedly comes up in future lessons. To resolve this in lesson 02 in PR #675, I simply changed all instances of taking the mean or standard deviation of an entire matrix to just a vector. Basically, I pass in one column (i.e. one day of inflammation data) instead of the entire matrix. I much prefer this approach. If others agree, I'll make similar changes to this lesson.

jdblischak commented 9 years ago

I had written:

So one last change I would like to propose. I find the use of both a data frame and a matrix to be quite awkward, and it repeatedly comes up in future lessons. To resolve this in lesson 02 in PR #675, I simply changed all instances of taking the mean or standard deviation of an entire matrix to just a vector. Basically, I pass in one column (i.e. one day of inflammation data) instead of the entire matrix. I much prefer this approach. If others agree, I'll make similar changes to this lesson.

I went ahead and implemented this. Removing the duplicate set of data stored as a matrix by instead performing mathematical operations on the columns of the data frame makes this lesson and the following ones go much smoother.

We've collectively put a lot of work into this PR. Could I please get some +1's for merging or let me know if there are additional points you would like to see addressed? @cboettig @gavinsimpson @chendaniely @jainsley Thanks!

gvwilson commented 9 years ago

+1 from me - very pleased by the way you've pulled this together.

chendaniely commented 9 years ago

Removing the duplicate set of data stored as a matrix by instead performing mathematical operations on the columns of the data frame makes this lesson and the following ones go much smoother.

That sounds good. I haven't looked at lesson 2 yet, but do we eventually use the as.matrix or as.dataframe or whatever?

jdblischak commented 9 years ago

do we eventually use the as.matrix or as.dataframe or whatever?

I avoided it in lesson 02. I know for sure it will come up in lesson 04 to make a heatmap. Still working on lesson 03, so I still don't know for that one.

chendaniely commented 9 years ago

:+1:

gavinsimpson commented 9 years ago

Hi @jdblischak,

I also favour ditching the rather odd mathematical operation over what are essentially panel data. The current data structure is not an ideal one for this anyway.

So +1 from me too.

jdblischak commented 9 years ago

OK, this appears well supported. @gvwilson, please merge this when you get the chance.

Thanks to everyone for the helpful reviews!