swcarpentry / DEPRECATED-bc

DEPRECATED: This repository is now frozen - please see individual lesson repositories.
Other
299 stars 382 forks source link

R: variable naming and other conventions? #615

Closed liz-is closed 10 years ago

liz-is commented 10 years ago

Does Software Carpentry have any existing conventions for R code?

e.g. variable naming: Google's style guide suggests lowerCamelCase, as does Bioconductor's guide. Hadley Wickham's Advanced R suggests using underscores.

Should there be guidelines for all the R lessons to follow?

jdblischak commented 10 years ago

Thanks for starting this discussion, @liz-is. I prefer underscores and am also ok with lowerCamelCase. We just need to avoid having periods in the names since this can be confusing for others used to writing in other languages.

Pragmatically, the novice R lessons use underscores for variable names because they were translated from the novice Python lessons.

I'll direct this question to the r-discuss list to see what others think.

benmarwick commented 10 years ago

Just saw this on r-discuss, my vote is for underscores following @hadley. Snake case is one popular name for this format. I find it more readable than camel case and simpler to type.

Definitely no periods as this causes problems when R and Javascript are used together, for example with rCharts.

This is an active research topic, with eye-tracking and so on, here are a few interesting pieces:

CamelCase vs underscores: Revisited 2013: Comment on the 2010 and 2009 studies mentioned below The State of Naming Conventions in R 2012: finds lowerCamelCase and period.separated dominant in R packages, but "What is most important, however, is to keep a consistent naming convention style within your code base" An Eye Tracking Study on camelCase and under_score Identifier Styles 2010: "no difference in accuracy between the two styles, subjects recognize identifiers in the underscore style more quickly" To CamelCase or Under score 2009: "camel casing leads to higher accuracy"

jhollist commented 10 years ago

+1 on no periods.

I have used lowerCamelCase for a long time, but also like snake case. My code is mess right now because of my internal conflict.

If I were to vote, I think I would agree with Ben and vote for Snake case

stephenturner commented 10 years ago

I don't have a strong opinion either way, but +1 in favor of consistency and no periods. I typically use snake_case in my variable names.

cboettig commented 10 years ago

Yeah, this a tricky issue because all conventions are in widespread use. I generally go with the @Hadely styleguide, but there's no escaping things like data.frame() and read.csv().

Periods appear in more subversive ways when R coerces character string values with spaces, it replaces them with periods: consider:

library("reshape2")
library("ggplot2")
acast(diamonds, carat, cut)

vs

data.frame(acast(diamonds, carat, cut))

The coercion (which can take place automatically in functions) coerces the spaces to periods, not underscores. The main problem with periods in my mind is with R functions, because they are also used for the S3 class system. Consistency is nice, but I think we need to teach R users that multiple conventions are just something we have to live with.

Though camelCase vs snake_case seems to be the most popular debate regarding style, there's lots of other related issues that either conflict between style guides or are not mentioned by most style guides. A few others folks might weigh in on:

and a few things that are just sloppy (e.g. library(ggplot2) vs library("ggplot2")) but the former is careless towards object class.

gavinsimpson commented 10 years ago

I positively hate underscores, primarily because to get one I need to type two _ in my editor (Emacs+ESS) because <- is bound to _.

If one is mixing functions from the Hadleyverse with object names it might be helpful to differentiate the two things, with snake_case for the functions and lowerCamel for objects.

To be frank, I'd favour a focus on consistency within a lesson over prescribing a particular coding style. Coding style is very personal; good coding practice is not and we want to teach that and not be overly prescriptive on the coding style. Also, I'm more likely to contribute lessons if I do it in my style (but consistently so). If someone wants to edit that material later to get a prescribed SWC style then that'd be fine with me, but just don't expect me to do it (and by "me" I mean other people who might contribute but who use a different style to the one SWC R might adopt).

If I were to be blunt; we have better things to be doing with our time than worrying about whether we use lowerCamel or snake_case :-)

gavinsimpson commented 10 years ago

@cboettig You can stop the name checking in data.frame() via an argument IIRC - having it replace `with a.` is helpful so you don't need to quote variable names to access them.

I don't see a reason to go exclusively with $ over [[ or vice-versa. Both have their uses; $ perhaps more in interactive use and [[ more in programming within functions. Recall R has dual-use/purpose.

Subsetting by variable name, index, or logical are all valuable skills to learn and come in useful in a range of circumstances. I would not want to enforce the use of one approach but rather to use whatever is natural in a given circumstance.

You should rarely need to call a function with the :: operator in general usage unless you have masking of functions on the search path. Let's try to restrict usage of :: to those situations only in any lesson material.

It should always be library("pkgname") and not library(pkgname) in material that might be around for a long time. The quoted version will always be around but I know at least one R Core member who has suggested the possibility of the unquoted version going away at some future point in time if R Core were to play with the loading of packages code - i.e. to get rid of library() because it causes confusion over what is an R library vs what is an R package.

naupaka commented 10 years ago

Somewhere, someone should be sure to mention avoiding the use of attach(), in general starting a script with something like rm(list = ls(all.names = TRUE)) to clean out variables, and getting in the practice of not saving .RData between sessions unless absolutely necessary. Not exactly naming conventions, but conventions nonetheless.

hadley commented 10 years ago

I think using rm(list = ls(all.names = TRUE)) is a bad idea, because it only partial resets a session. You're better of teaching people to never save .Rdata, and to regular reload R (which is particularly easy in Rstudio)

benmarwick commented 10 years ago

@naupaka I would never have heard of attach if it wasn't for all the warnings against using it, so I wonder if we just don't mention it at all it might never enter the user's vocabulary?

rm everything at the top of a script is good advice but only under limited specific circumstances. If we've got a sequence of scripts and each has that at the top obviously that's going to be problematic, as it would be in a package. As a general rule I'd vote against it, and vote instead for more general advice about being alert to the contents of the working environment.

Agree 100% with not saving the .RData except under special circumstances.

naupaka commented 10 years ago

@hadley and @benmarwick Those are both good points about rm(list). I guess I thought it was simpler than something like this in a script. But perhaps unnecessary to get into, esp. with beginners. And very easy to do in the GUI, as @hadley mentioned.

gavinsimpson commented 10 years ago

No, clearing the workspace at the head of a script is the work of the Devil and should be banished. Learning to start a clean session with a new R instance is good practice that we should promote.

Also, saved objects serialized to .rds files is the way to handle computationally demanding objects which can be loaded by a script, but the script should be written in such a way that it is easy to regenerate those objects so you can test that the entire script can be rerun.

chendaniely commented 10 years ago

In RStudio, is this the same as clicking clear and checking off 'include hidden objects'?

gavinsimpson commented 10 years ago

@chendaniely I presume so, but we tend to avoid teaching using the RStudio GUI so students learn to use R in instances where you don't have RStudio running such as a compute cluster.

cboettig commented 10 years ago

@gavinsimpson That's a great point; besides it's always worth students knowing how to script the behaviour they want rather than click for it, since it provides a clear and reproducible record of what they've done.

That said, it has gotten remarkably easy to run RStudio-servers on cloud clusters, reducing the barriers to entry there. Admittedly that's not true for most university or HPC clusters, but that is probably becoming less and less important.

On Thu, Aug 21, 2014 at 10:30 AM, Gavin Simpson notifications@github.com wrote:

@chendaniely https://github.com/chendaniely I presume so, but we tend to avoid teaching using the RStudio GUI so students learn to use R in instances where you don't have RStudio running such as a compute cluster.

— Reply to this email directly or view it on GitHub https://github.com/swcarpentry/bc/issues/615#issuecomment-52953906.

Carl Boettiger UC Santa Cruz http://carlboettiger.info/