moj-analytical-services / rpackage_training

Making and developing R packages
11 stars 1 forks source link

Developing R packages

Pre-course requirements

This training is designed as an introduction to making and developing R packages which are important to reproducible ways of working. You should first have completed the following training sessions (or reached an equivalent standard to having done so):

You must also have completed steps 1 to 4 and 6 of the MoJ Analytical Platform quickstart guide, making sure you can access RStudio from the control panel. If you have any issues, please post them in the appropriate Slack channel (either #ask-operations-engineering or #intro_r).

You will also require access to the S3 bucket alpha-r-training. You can post an access request to the #intro_r slack channel.

Using two screens (e.g. your laptop plus a monitor) during the training session might be useful to enable you to watch the session on one and code on the other.

Recordings of these sessions can be viewed via links provided in the Analytical Platform and related tools training section on R training. If you have any access problems please contact aidan.mews@justice.gov.uk.

Contents

Section 1 - Introduction

This training is based on Hadley Wickham's book R Packages. The goal of it is to teach you how to make and develop packages. R packages are not difficult to make and have several benefits:

These benefits together improve the reliability, reusability and sharability of code, and give you the confidence to update it without the fear of unknowingly breaking something.

This training is designed with exercises to enable you to develop a package. Your example package will include functions to fetch data from s3 and build a simple tabulation like those found in many publication tables and MI-packs etc. A preview of the data we will be using is given below:

Rows: 6,000,000
Columns: 3
$ year  <int> 2004, 2005, 2004, 2002, 2002, 2000, 2002, 2000, 2005, 2001, 2003, 2003, 200…
$ month <chr> "December", "June", "September", "August", "April", "May", "April", "March"…
$ crime <chr> "Crime C", "Crime A", "Crime B", "Crime B", "Crime C", "Crime C", "Crime C"…

Section 2 - Package scope and naming

Before you start developing a package there are two questions to consider "what will your package contain?" (the scope) and "what will you call it?" (the name).

The scope

You could put every function you ever write into one package but it is likely that this would quickly become difficult to maintain especially if this resulted in a large number of dependencies. Instead it is better to group your functions into thematically similar activities. For example the {forcats} package contains functions for working with categorical data and factors and the {stringr} package contains functions for working with strings and regular expressions.

Some packages may contain generalized functions (on a particular theme) that have a broad spectrum of applications e.g. {psutils}. Others may contain very specialized functions that are only used as part of one process e.g. {pssf}.

It is also worth considering whether your functions might fit within an existing package rather than starting a new one.

The name

Possibly the hardest part of creating a package is choosing a name for it. This should:

You can read more in the R Packages section Name your package

Exercises

Section 3 - Package structure

R packages have a standard structure. The following components must be included (either because they are essential package components or because they are essential parts of the development and maintenance process).

Some packages may have other components, a few common ones that you may want to use are listed below:

Exercises

Section 4 - Create the package

Essential development practice for R packages

The default branch of an R package GitHub repo must be reserved for working releases of the package. Always make your changes on a different branch then merge to the default branch for each release. You should also add protections to your main branch to shield it from accidental pushes. (We will skip this step in the training for speed but it is very important for production code).

Exercises

Tools to help with package development

There are several R packages that contain tools to help ensure your package is set up in the correct format and aid development by automating common tasks. The two we will be using today are {devtools} and {usethis}.

Exercises

The following {usethis} function will structure your current working directory as an R package (you will need to overwrite what is already there when prompted):

usethis::create_package(getwd())

This will create several of the files and folders discussed at the start of the package structure section.

Exercises

Section 5 - Copyright and licencing

Licencing code is essential as it sets out how others can use it. You can read more about licencing here. The work-product of civil servants falls under Crown copyright and usually requires an Open Government Licence but for open source software we have the option to use other open source licences. The MIT licence is the MoJ preferred choice and can be added to your package using:

usethis::use_mit_license("Crown Copyright (Ministry of Justice)")

This will add two text files to the top level of your project, LICENCE and LICENCE.md. It will also update the relevant section in the DESCRIPTION file and update the .buildignore file.

Exercises

Section 6 - Package metadata

The DESCRIPTION file contains important metadata about the package; it is a text file that you can open and edit in RStudio. An example of an amended DESCRIPTION file is provided here. The formatting is important. Each line consists of a field name and a value, separated by a colon. Where values span multiple lines, they need to be indented. In particular:

Exercises

Authors

Package authors are supplied as a vector of persons i.e. c(person(...), person(...)). In addition to a given name, family name, and an email, each person should have a role specified. More information can be found by running ?person but the four most common roles are detailed below (multiple roles should be combined with c()):

Exercises

Semantic Versioning

Semantic Versioning is a version control paradigm which uses a major.minor.patch system to communicate what type of changes occur between versions. A "major change" will increment the major number, a "minor change" will increment the minor number and a "patch change" will increment the patch number. The type of version change is linked to the type of code changes you make. The full Semantic Versioning specification is worth reading and learning (especially points 2-8) but a basic summary for now:

Exercises

Dependency management

The Imports and Suggests fields are used for dependency management for your package/ development processes. You want to be as permissive as possible specifying minimum or maximum versions of packages listed in Imports and Suggests to increase the compatibility of your package with others. If you know that your code relies on functionality added in a particular version of a package you must specify the minimum version otherwise don't specify a minimum version.

Any package that your code relied upon for core functionality should be listed in the "Imports" section. The "Suggests" section is for packages that are used in the development process or give extra optional functionality.

There is a tool in {usethis} for adding packages to the description file. It will check if the package is installed before adding it so is useful for catching spelling mistakes!

By default, packages are added as Imports e.g. to add {dplyr} as an import: usethis::use_package("dplyr"). You can use the type argument to add them to Suggests instead e.g. to add {devtools} as a suggested package: usethis::use_package("devtools", type = "Suggests").

Exercises

Section 7 - Checking your package

Packages require that the right files and the right information are in the right places. A small mistake can prevent the package from functioning as intended. Many package features can be checked using the function devtools::check(). It runs a series of checks that examine (among other things) package structure, metadata, code structure, and documentation. More information about the individual checks is available here. Any issues that are identified will be labeled as "errors", "warnings" or "notes". Errors and warnings must be fixed. Occasionally it is acceptable to leave a "note" but usually these should be fixed too.

Exercises

Section 8 - Adding functions

A training course on writing functions in R is available here but for speed in this course we will skip over function development.

We are going to include two functions in our example package, one that builds a tabulation of data and another that fetches some data from s3 before building the tabulation. The functions omit things like data validation and error handling that you should include in real production code.

In a package, functions must be saved in .R files in the R/ folder. You can have multiple functions in a single script (suggestions about how to organise your functions is available here) but we will use one function per file for this exercise.

wrangle data function

wrangle_data <- function(df, pub_year) {
  df |>
    dplyr::filter(.data$year == pub_year) |>
    dplyr::mutate(
      month_fct = forcats::fct(.data$month, month.name)
    ) |>
    dplyr::group_by(.data$crime, .data$month_fct, .drop = FALSE) |>
    dplyr::count() |>
    tidyr::pivot_wider(names_from = "month_fct", values_from = "n", values_fill = 0)
}

assemble crime data function

assemble_crime_data <- function(path, year) {
  path |> 
    arrow::read_parquet() |> 
    wrangle_data(pub_year = year)
}
Exercises

Section 9 - Making functions work in a package

While the format of code inside a package is very similar to "normal R code", it is vital to properly reference functions that you are using from other packages. You must never use library(), require() or source() calls inside a package; instead you should use package::function() syntax. More information on why this is the case is available here. In some instances it is better to import a function from the relevant namespace (more on this later).

Because packages like {dplyr} use "tidy evaluation" we need to make some changes to the code when including it within packages (more information here). In the wrangle data function we get around the use of unquoted column names by including the .data "pronoun". For example, outside of a package context iris |> dplyr::filter(Species == "Setosa") is valid syntax and Species will be interpreted as a string (the name of a column in the data frame iris) via "tidy evaluation". In a package context however, it will be interpreted as an object name (and probably the name of an object without a definition). This will cause the checks on the package to fail.

Exercises

Section 10 - Documenting functions

Documentation is really important so users know how to use the package, and package managers and developers can quickly get up to speed. It should therefore be embedded within the package in such a way that it is easily available to all users.

We can include "roxygen comments" with our functions to provide documentation that can be automatically knitted into help files. Roxygen comments are denoted by hash and a single quotation mark followed by a space #'. Comments can then be labeled with a tag which is a string starting with @ e.g. @title would be the tag for the help file's title.

A set of roxygen comments for the assemble crime data function is given below.

#' @title Assemble Crime Data
#' @description Fetch crime data from a specified path and tabulate ready for publication.
#' @param path A string. The path or S3 URI to the parquet file containing the data.
#' @param year The year of the publication.
#' @export
#' @examples
#' assemble_crime_data(
#'   "s3://alpha-r-training/r-package-training/synthetic-crime-data.parquet", 
#'   year = 2000
#' )

As a minimum, for each function exported for users of your package you should include:

There is a special tag @export which indicates that the function should be added to the NAMESPACE of your package. This means it will be accessible to users of your package and using the @export tag will also trigger the generation of a help file. Any functions that are for internal package use only should not be tagged with @export.

There is another special tag @importFrom that can be used to import functions and methods etc from the NAMESPACE of other packages. The use of this should be reserved for things like operators and functions that are always nested inside other functions (for example aes() from {ggplot2}) and pronouns where the use of :: syntax is either invalid or makes the code hard to read.

Once we have added our roxygen comments we can use devtools::document() to generate the the help files. These will be saved in the man/ folder. You will also see that the function is now listed in the NAMESPACE file. (Note that devtools::document() is also run as part of devtools::check()).

Exercises

Section 11 - Testing your code

You have written (in this case been given) some code but how do you know that it is actually doing what you intended? You might use devtools::load_all() to load your package and then try the functions to see if they give the expected output. This works but every time you need to test your functions (e.g. if any changes are made to your code base or if there are changes in your dependencies) you will need to re-create the inputs to the function and re-write the code. This quickly makes testing a very time consuming process.

We can instead formalize this testing process (and automate the running of it) using the {testthat} R package. When we run the function usethis::use_testthat() it will:

Exercises

The structure of a test

The {testthat} tests contain two elements, the name of the test and one or more expectations. A test will fail if at least one expectation is not met or if there is an unexpected error.

You can have multiple tests for a single function so the name of the test is important for identifying which test failed (when it fails). The test name should therefore contain information about what you are testing i.e. the function name and what specific behavior you are testing. Each test should always have a unique name within a package to avoid wasting time debugging the wrong test!

Expectations are a series of functions that check for the presence or absence of specific values or properties in function outputs or their side effects.

Exercises

Tests for the assemble crime data function

Some tests for the assemble crime data function are given below. We are checking that when a valid path (and year) are supplied we get a data frame and no warnings are generated. We are not worried about testing the content of the data frame here as that is controlled by the wrangle data function. We will cover that with the tests for that function.

Due to the absence of bespoke error handling/ input checking in the function, and time constraints when running the training, we are largely ignoring the year argument in the assemble crime data function. Furthermore, for "real" production code it would probably be safer/simpler to have separate functions for "getting a data frame into R" and "doing stuff to the data frame" rather than just relying on one that combines both elements. Structuring it like this for the training is useful for conveying particular points in the training.

Additionally, we are checking that when an invalid path is used we get an error.

test_that("assemble_crime_data works with valid path", {

  uri <- "s3://alpha-r-training/r-package-training/synthetic-crime-data.parquet"

  assemble_crime_data(uri, year = 2000) |> expect_s3_class("data.frame")
  assemble_crime_data(uri, year = 2001) |> expect_no_warning()

})

test_that("assemble_crime_data fails with invalid path", {

  assemble_crime_data("foo", year = 2001) |> expect_error()

})
Exercises

Test coverage

Test coverage is a metric that can be useful in assessing the adequacy of tests. The {covr} package can be used to examine test coverage. It builds the package and runs the tests in a modified environment counting how many times each line of package code is run by the tests. You should aim to have every line covered by tests but don't rely on coverage alone when assessing the adequacy of tests. When we run the test coverage of our package we will get 100% (the wrangle data function is called by the assemble crime data function) but we are not (yet) properly testing the intended behaviour of the wrangle data function.

Test coverage can be particularly useful where you have if() statements in your code to help you ensure that all the various conditions that can arise have been covered. For example, if the assemble crime data function did something special when the year was set to 2002 those lines would not be covered by our existing text and this would be revealed by examining the test coverage.

if (year == 2002) {
  message("Happy 2002!")
}
Exercises

Tests for the wrangle data function

In order to properly test the wrangle data function we probably want to ensure that the following exceptions are met in the output data frame:

We probably don't want to use "real" data when writing tests. By checking specific things like values, number of rows, number of columns etc in the outputs there is a risk of revealing unpublished information. Real data may also be subject to change (potentially causing tests to fail incorrectly). Additionally, real data is likely to be quite large (slowing down the testing process) and contain a lot of noise i.e. elements that are not relevant for testing a specific function.

We will use the following data frame to test the wrangle data function. It contains only the three columns used by the test and two rows. The values for crime are dummy values i.e. not the same as the values used in the "real" data but that difference is not important for testing whether the function works.

testing_df <- data.frame(
    crime = c("foo", "bar"),
    year = 2000:2001,
    month = "January"
  )
Exercises

Section 12 - Add a README

The README acts as a "quick-start guide" for users of your package. It should include:

You can use a simple markdown README or dynamically generate one using R Markdown which enables the ability to embed code chunks and several other extensions useful for writing technical reports. The latter may be preferable if you want to demonstrate what some of your code does. You can add a README with either usethis::use_readme_md() or usethis::use_readme_rmd() depending on the type you want.

Exercises

Section 13 - Add a NEWS file

The NEWS markdown file functions as a change-log for your package. It must be updated every time you make changes to your package.

Exercises

Section 14 - Managing releases of your package

Congratulations, you have successfully produced a working package in R! Open a pull request and merge it to the main branch.

GitHub Releases are a great way to manage the versions of your package. Every time you release an updated version of your package, include a GitHub release. This way if you ever need an older version of your package it is very easy to install using the GitHub Release Tag.

Exercises

Section 15 - Installing and using your package

To install a package from a public GitHub repo using renv you just need the owner and the repo:

renv::install("moj-analytical-services/mojchart")

The easiest way install a package from an internal or private GitHub repo is with the following (SSH URL) syntax:

renv::install("git@github.com:moj-analytical-services/mojchart.git")

Note: If your package has any Imports that are from internal or private repos you will need to also use this syntax in the Remotes field. For example the {psutils} package has {verify} as an import which is another internal package available from this SSH remote.

With renv >= 0.15.0 you can also include @ref on the end of the URL where the "ref" is a branch name, commit or github tag e.g.

renv::install("git@github.com:moj-analytical-services/verify.git@v0.0.19")
Exercises

Section 16 - Maintenance cycle

You have released your package and have received some feedback from a user - "it would be better if the year was also included in the date column headings".

Exercises

Annex

A1 Continuous integration

Continuous integration is about automating software workflows. An automated workflow can be setup so that when you or someone else pushes changes to github.com, tests are run to ascertain whether there are any problems. These checks should include the unit tests you've developed and also the R CMD tests (over 50 individual checks for common problems) carried out when you run devtools::check().

Before setting up this automation, you should have fixed any problems identified by running the R CMD tests - see Section 7 - Checking your package.

To setup continuous integration using GitHub Actions:

usethis::use_github_actions()

This automatically puts a status badge in your README.

You can read further about automating checking in R Packages Automated Checking chapter.

A2 Solution to testing wrangle data function exercises

test_that("wrangle_data works", {

  testing_df <- data.frame(
    crime = c("foo", "bar"),
    year = 2000:2001,
    month = "January"
  )

  out_df_1 <- testing_df |> wrangle_data(pub_year = 2000)

  out_df_1 |> ncol() |> expect_equal(13)
  out_df_1 |> names() |> tail(12) |> expect_equal(month.name)
  out_df_1$crime |> expect_equal("foo")

  out_df_2 <- testing_df |> wrangle_data(pub_year = 2001)

  out_df_2$crime |> expect_equal("bar")

})

A3 Installing packages on the Analytical Platform prior to R 4.4.0

Most R packages you install come from CRAN (The Comprehensive R Archive Network) which stores them on a series of mirrored servers that act as package repositories. Prior to R version 4.4.0 the Analytical Platform is set up to use a fixed R package repository by default. Depending on the version of R on the Analytical Platform you are using, this may be fairly old. Run options("repos") in the console and look at the date at the end to see which version you are using. To access the latest versions of packages you can use the following to update where you install from (this will reset when R is restarted).

options(repos = "https://packagemanager.rstudio.com/all/__linux__/focal/latest")