rainersachs / HG_19plus-obsole

10/9/2019 See Edward's still useful style tips in issue#2
GNU General Public License v3.0
0 stars 0 forks source link

Style and general tips #3

Open eghuang opened 5 years ago

eghuang commented 5 years ago

30 May 2019: Copied from NASAmouseHG.

0. GitHub & RStudio setup and issues

Most common issues with Github and RStudio can be resolved after cursory searching, but sometimes we are unaware that there is an issue with our current setup. We should at least be familiar with and able to access the following list of git, Github, and RStudio functions.

0.1 Branches

Refer to this subsection of Wickham's R Packages for information on basic Git and branching.

1. Style guidelines

We will follow Hadley Wickham's style guide for our scripts. It accurately reflects the style conventions of the R community. It additionally addresses most of the style issues in our script. Wickham's guidelines are derived from the Google style guide, which is more detailed and should be consulted for topics not covered in Wickham's guide.

1.1. Reminders

2. General programming tips:

Wickham's Advanced R is strongly recommended as a general resource.

2.1. Environment Management

To clear the global environment (erase all variables, functions, data, etc.) use:

rm(list = ls())

2.2. Debugging with breakpoints

2.2.1. browser() breakpoints

To better understand why your code may be raising error messages, you may add browser() to a new line above the code you suspect is buggy. When you run your code, the debugger will be raised at the line with browser() and drop you into the current environment1 of the script. This means that everything that has been created or changed by the code up to that line will be available to you to view and call in the console. With browser(), you can easily check or test objects created in function environments. Hint: use str() to check the type of an object.

For example, you can use browser() to check how the value of a variable changes in a loop. Consider the following function:

loop <- function(x) {
  for (i in seq(100)) {
    browser()
    x <- x + 1
  }
  return(x)
}

Suppose you want to examine the behavior of loop. If you run loop without the browser() call in the third line, you simply get the output of loop. With the browser() call, you can closely examine the environment of loop. When loop() is called, browser() drops you into the debugger when it is evaluated. If we check the value of x, we can see that it is 0.

> loop(0)
Called from: loop(0)
Browse[1]> x
[1] 0

If we let the debugger continue to the next browser() (press the continue button or run c) and check the value of x, then we can see that x is 1 in the second loop.

Browse[1]> c
Called from: loop(0)
Browse[1]> x
[1] 1

We can continue to run the debugger to see how x changes as loop runs.

Browse[1]> c
Called from: loop(0)
Browse[1]> x
[1] 2

In this particular example, loop is clearly simple, but with more complicated functions or loops, browser() can shed much more insight.

2.2.2. Editor breakpoints

RStudio allows users to set breakpoints without changing the existing code by clicking directly to the left of a line of code. A red dot should appear. If a red circle appears instead, then the breakpoint is deferred. This can happen for a number of reasons, but saving the file or running source() in the console or with the editor should change the circle to a dot. Editor breakpoints, unlike browser() breakpoints, can only be used with source(). They are generally less versatile than browser().

2.2.3. Debugger console

Running or sourcing code with active breakpoints will halt execution at the first encountered breakpoint. At this point, the console will display several new commands:

The console also can run most R code, which is useful for checking the values of variables or writing test functions within the debugger.

2.2.4. Additional resources

RStudio documentation for debugging resources can be found here.

2.3. Locating source code

Run getAnywhere(function). However, it's usually more useful to step into source code when using breakpoints and the debugger, especially for complicated functions.

2.4. Reducing runtime

Sometimes we would like our programs to run faster. Here are various methods to locate and rewrite slow code to be more efficient.

2.4.1. Finding slow code

Use proc.time() if you suspect that a certain part of your code is abnormally slow. Calling proc.time() before and after your code allows you to find the actual runtime as the difference between the proc.time() calls. As a simple example:

> startTime <- proc.time()
> n = 0
> for (i in seq(100, .01)) n = n + i
> endTime <- proc.time()
> endTime - startTime

   user  system elapsed 
  0.005   0.001   0.041 

Note that the results are given in units of a second.

2.4.2. Writing faster code

Most of our inefficient code results from bad design. Make sure that your higher-order functions and algorithms are not making unnecessary calls and that you thoroughly understand what your code is doing. Try to preallocate calculations.

See this StackOverflow post for further reading.

2.5. Dataframe subsetting

See this link for basic examples.

2.6. tryCatch

Useful for error handling. See this link for a great primer.

3. Footnotes

1 An environment is essentially a space in which objects such as variables and functions are defined. The global environment is the the default environment and the outermost environment we work in. Anything defined or loaded outside of a function call exists in the global environment. Each time a function is called, a new environment called a "frame" is opened. Objects created inside a function call, including other functions, will be defined in the new frame. The environment or existing frame in which the new frame is opened is the new frame's "parent environment". Objects defined in the parent environment can be used in their child frames, but a child frame cannot redefine variables in the parent environment. The code below demonstrates what happens when one attempts to redefine a variable in the parent environment.

> a <- 1 #  Not in function call, defined in global environment
> foo <- function() { #  Creates a frame F1 inside the global environment. F1 can use anything defined in the global environment.
>   a <- a + 1 #  Defines the new variable a inside F1, not the parent environment.
>   return(a)
> }

> foo() 
[1] 2

> a #  Note that a is not changed in the global environment.
[1] 1

If another function is called inside of the first function body, then a second frame is created such that the parent environment of the second frame is the first frame. This implies that the second function has access to any objects created in the first frame or the global environment. Any further nested functions behave similarly.

a <- 1 #  a is defined in global environment
foo <- function() { #  Creates a frame F1 inside of the global environment.
  b <- a + 1 #  b is defined in F1. foo can use variables defined in the global environment.
  foobar <- function() { #  Creates a frame F2 inside of F1. 
    c <- a + b #  c is defined in F2. foobar can use variables in both F1 and the global environment.
    return(c)
    }
  return(foobar())
}

When the function call terminates, the frame is closed and all the objects defined within it are discarded. Only the output of the function call (the return() call in a frame) is passed from the child frame to the parent environment. In the example above, b and c are discarded after calling foo. However, c is the output of foobar(), so a call to foo() would return the value of c, or 3.

rainersachs commented 5 years ago

I think the correct style guide is Advanced R by Hadley Wickham at http://adv-r.had.co.nz/Style.html. the google style guide I got by googling uses different conventions.

eghuang commented 5 years ago

I agree that Wickham's conventions are more appropriate for our purposes and more closely resembles our current code. However, the Google style guide covers a few topics not found in Wickham's guide so I suggest we defer to Google's guide for anything not already in Wickham's guide. I have edited my post to include a link to Wickham's guide and instructions for when to use which guide.

10 Jun 2019: Added links to resources for Git, branching, and general R. 21 Jul 2019: Minor rewording, added links to dataframe subsetting and tryCatch.