This training is designed as an introduction to making and developing R packages which are important to reproducible ways of working. You should first have completed the following training sessions (or reached an equivalent standard to having done so):
You must also have completed steps 1 to 4 and 6 of the MoJ Analytical Platform quickstart guide, making sure you can access RStudio from the control panel. If you have any issues, please post them in the appropriate Slack channel (either #ask-operations-engineering or #intro_r).
You will also require access to the S3 bucket alpha-r-training
. You can post an access request to the #intro_r slack channel.
Using two screens (e.g. your laptop plus a monitor) during the training session might be useful to enable you to watch the session on one and code on the other.
Recordings of these sessions can be viewed via links provided in the Analytical Platform and related tools training section on R training. If you have any access problems please contact aidan.mews@justice.gov.uk.
This training is based on Hadley Wickham's book R Packages. The goal of it is to teach you how to make and develop packages. R packages are not difficult to make and have several benefits:
These benefits together improve the reliability, reusability and sharability of code, and give you the confidence to update it without the fear of unknowingly breaking something.
This training is designed with exercises to enable you to develop a package. Your example package will include functions to fetch data from s3 and build a simple tabulation like those found in many publication tables and MI-packs etc. A preview of the data we will be using is given below:
Rows: 6,000,000
Columns: 3
$ year <int> 2004, 2005, 2004, 2002, 2002, 2000, 2002, 2000, 2005, 2001, 2003, 2003, 200…
$ month <chr> "December", "June", "September", "August", "April", "May", "April", "March"…
$ crime <chr> "Crime C", "Crime A", "Crime B", "Crime B", "Crime C", "Crime C", "Crime C"…
Before you start developing a package there are two questions to consider "what will your package contain?" (the scope) and "what will you call it?" (the name).
You could put every function you ever write into one package but it is likely that this would quickly become difficult to maintain especially if this resulted in a large number of dependencies. Instead it is better to group your functions into thematically similar activities. For example the {forcats} package contains functions for working with categorical data and factors and the {stringr} package contains functions for working with strings and regular expressions.
Some packages may contain generalized functions (on a particular theme) that have a broad spectrum of applications e.g. {psutils}. Others may contain very specialized functions that are only used as part of one process e.g. {pssf}.
It is also worth considering whether your functions might fit within an existing package rather than starting a new one.
Possibly the hardest part of creating a package is choosing a name for it. This should:
You can read more in the R Packages section Name your package
R packages have a standard structure. The following components must be included (either because they are essential package components or because they are essential parts of the development and maintenance process).
Some packages may have other components, a few common ones that you may want to use are listed below:
The default branch of an R package GitHub repo must be reserved for working releases of the package.
Always make your changes on a different branch then merge to the default branch for each release.
You should also add protections to your main
branch to shield it from accidental pushes. (We will skip this step in the training for speed but it is very important for production code).
dev
in RStudio where we will begin building the package.There are several R packages that contain tools to help ensure your package is set up in the correct format and aid development by automating common tasks. The two we will be using today are {devtools} and {usethis}.
install.packages()
, install the {devtools} and {usethis} packages. If you are using R < 4.4.0 on the AP please review appendix A3 first.The following {usethis} function will structure your current working directory as an R package (you will need to overwrite what is already there when prompted):
usethis::create_package(getwd())
This will create several of the files and folders discussed at the start of the package structure section.
usethis::create_package(getwd())
. You will be asked if you want to overwrite the existing .Rproj file. You do!Licencing code is essential as it sets out how others can use it. You can read more about licencing here. The work-product of civil servants falls under Crown copyright and usually requires an Open Government Licence but for open source software we have the option to use other open source licences. The MIT licence is the MoJ preferred choice and can be added to your package using:
usethis::use_mit_license("Crown Copyright (Ministry of Justice)")
This will add two text files to the top level of your project, LICENCE
and LICENCE.md
. It will also update the relevant section in the DESCRIPTION file and update the .buildignore file.
The DESCRIPTION file contains important metadata about the package; it is a text file that you can open and edit in RStudio. An example of an amended DESCRIPTION file is provided here. The formatting is important. Each line consists of a field name and a value, separated by a colon. Where values span multiple lines, they need to be indented. In particular:
|>
) in your package you would need to specify R (>= 4.1.0). Package authors are supplied as a vector of persons i.e. c(person(...), person(...))
. In addition
to a given
name, family
name, and an email
, each person should have a role
specified. More
information can be found by running ?person
but the four most common roles are detailed below
(multiple roles should be combined with c()
):
person("Crown Copyright (Ministry of Justice)", role = "cph")
Semantic Versioning is a version control paradigm which uses a major.minor.patch system to communicate what type of changes occur between versions. A "major change" will increment the major number, a "minor change" will increment the minor number and a "patch change" will increment the patch number. The type of version change is linked to the type of code changes you make. The full Semantic Versioning specification is worth reading and learning (especially points 2-8) but a basic summary for now:
The Imports and Suggests fields are used for dependency management for your package/ development processes. You want to be as permissive as possible specifying minimum or maximum versions of packages listed in Imports and Suggests to increase the compatibility of your package with others. If you know that your code relies on functionality added in a particular version of a package you must specify the minimum version otherwise don't specify a minimum version.
Any package that your code relied upon for core functionality should be listed in the "Imports" section. The "Suggests" section is for packages that are used in the development process or give extra optional functionality.
There is a tool in {usethis} for adding packages to the description file. It will check if the package is installed before adding it so is useful for catching spelling mistakes!
By default, packages are added as Imports e.g. to add {dplyr} as an import:
usethis::use_package("dplyr")
. You can use the type
argument to add them to Suggests instead e.g.
to add {devtools} as a suggested package: usethis::use_package("devtools", type = "Suggests")
.
usethis::use_package("R", type = "Depends", min_version = "4.1.0")
Packages require that the right files and the right information are in the right places. A small
mistake can prevent the package from functioning as intended. Many package features can be checked
using the function devtools::check()
. It runs a series of checks that examine (among other things)
package structure, metadata, code structure, and documentation. More information about the
individual checks is available here. Any issues that are
identified will be labeled as "errors", "warnings" or "notes". Errors and warnings must be fixed.
Occasionally it is acceptable to leave a "note" but usually these should be fixed too.
devtools::check()
- there should be no errors, warnings or notes.A training course on writing functions in R is available here but for speed in this course we will skip over function development.
We are going to include two functions in our example package, one that builds a tabulation of data and another that fetches some data from s3 before building the tabulation. The functions omit things like data validation and error handling that you should include in real production code.
In a package, functions must be saved in .R files in the R/ folder. You can have multiple functions in a single script (suggestions about how to organise your functions is available here) but we will use one function per file for this exercise.
wrangle_data <- function(df, pub_year) {
df |>
dplyr::filter(.data$year == pub_year) |>
dplyr::mutate(
month_fct = forcats::fct(.data$month, month.name)
) |>
dplyr::group_by(.data$crime, .data$month_fct, .drop = FALSE) |>
dplyr::count() |>
tidyr::pivot_wider(names_from = "month_fct", values_from = "n", values_fill = 0)
}
assemble_crime_data <- function(path, year) {
path |>
arrow::read_parquet() |>
wrangle_data(pub_year = year)
}
devtools::check()
- You will get a warning about undeclared imports and a note about an "undefined global function or variable". We will deal with these in the next section.While the format of code inside a package is very similar to "normal R code", it is vital to
properly reference functions that you are using from other packages. You must never use
library()
, require()
or source()
calls inside a package; instead you should use
package::function()
syntax. More information on why this is the case is available
here. In some instances it is better to import
a function from the relevant namespace (more on this later).
Because packages like {dplyr} use "tidy evaluation" we need to make some changes to the code when
including it within packages (more information
here). In the wrangle data function we get
around the use of unquoted column names by including the .data
"pronoun". For example, outside of
a package context iris |> dplyr::filter(Species == "Setosa")
is valid syntax and Species
will
be interpreted as a string (the name of a column in the data frame iris
) via "tidy evaluation".
In a package context however, it will be interpreted as an object name (and probably the name of an
object without a definition). This will cause the checks on the package to fail.
package::function()
syntax in the functions..data
pronoun in the wrangle data function.devtools::check()
- you will still be getting the note about .data
- we will deal with this in the next section.Documentation is really important so users know how to use the package, and package managers and developers can quickly get up to speed. It should therefore be embedded within the package in such a way that it is easily available to all users.
We can include "roxygen comments" with our functions to provide documentation that can be
automatically knitted into help files. Roxygen comments are denoted by hash and a single quotation
mark followed by a space #'
. Comments can then be labeled with a tag which is a string starting
with @ e.g. @title
would be the tag for the help file's title.
A set of roxygen comments for the assemble crime data function is given below.
#' @title Assemble Crime Data
#' @description Fetch crime data from a specified path and tabulate ready for publication.
#' @param path A string. The path or S3 URI to the parquet file containing the data.
#' @param year The year of the publication.
#' @export
#' @examples
#' assemble_crime_data(
#' "s3://alpha-r-training/r-package-training/synthetic-crime-data.parquet",
#' year = 2000
#' )
As a minimum, for each function exported for users of your package you should include:
@title
- the title for the help file@description
- a description of what your function does@param
- One for each argument in your function (Note that the name of the parameter comes after the tag followed by another space before the text describing the parameter)@examples
- Sufficient examples for users to get started with your function (most people will probably look at the examples before reading the text!)There is a special tag @export
which indicates that the function should be added to the NAMESPACE
of your package. This means it will be accessible to users of your package and using the @export
tag
will also trigger the generation of a help file. Any functions that are for internal package use only
should not be tagged with @export
.
There is another special tag @importFrom
that can be used to import functions and methods etc from
the NAMESPACE of other packages. The use of this should be reserved for things like operators and
functions that are always nested inside other functions (for example aes()
from {ggplot2}) and
pronouns where the use of ::
syntax is either invalid or makes the code hard to read.
Once we have added our roxygen comments we can use devtools::document()
to generate the the help
files. These will be saved in the man/
folder. You will also see that the function is now listed
in the NAMESPACE file. (Note that devtools::document()
is also run as part of
devtools::check())
.
devtools::document()
- you will now see a file in man/
and a change to the NAMESPACEdevtools::load_all()
followed by ?assemble_crime_data
to view the help file generated from the roxygen commentsdevtools::document()
- you will see another file in man/
and other function added to the NAMESPACE#' @importFrom dplyr .data
devtools::document()
- you will see a new line in your NAMESPACE file that makes dplyr's .data
available for use in your package. This syntax should also be used for things like operatorsdevtools::check()
man/
files and the NAMESPACE file.You have written (in this case been given) some code but how do you know that it is actually doing
what you intended? You might use devtools::load_all()
to load your package and then try the
functions to see if they give the expected output. This works but every time you need to test your
functions (e.g. if any changes are made to your code base or if there are changes in your
dependencies) you will need to re-create the inputs to the function and re-write the code. This
quickly makes testing a very time consuming process.
We can instead formalize this testing process (and automate the running of it) using the {testthat}
R package. When we run the function usethis::use_testthat()
it will:
testthat (>= 3.0.0)
to the Suggests field in the DESCRIPTION file.tests/
folder, inside of which is a testthat/
folder, where your R test scripts should be placed, and a testthat.R
which helps in automating the testing.usethis::use_testthat()
to set up the testing infrastructure.usethis::use_test()
. This will open a new script which is saved in tests/testhat/
. The script will have the same name as the function script but will have a test-
prefix. An example test will be given.The {testthat} tests contain two elements, the name of the test and one or more expectations. A test will fail if at least one expectation is not met or if there is an unexpected error.
You can have multiple tests for a single function so the name of the test is important for identifying which test failed (when it fails). The test name should therefore contain information about what you are testing i.e. the function name and what specific behavior you are testing. Each test should always have a unique name within a package to avoid wasting time debugging the wrong test!
Expectations are a series of functions that check for the presence or absence of specific values or properties in function outputs or their side effects.
Some tests for the assemble crime data function are given below. We are checking that when a valid path (and year) are supplied we get a data frame and no warnings are generated. We are not worried about testing the content of the data frame here as that is controlled by the wrangle data function. We will cover that with the tests for that function.
Due to the absence of bespoke error handling/ input checking in the function, and time constraints
when running the training, we are largely ignoring the year
argument in the assemble crime data
function. Furthermore, for "real" production code it would probably be safer/simpler to have
separate functions for "getting a data frame into R" and "doing stuff to the data frame" rather
than just relying on one that combines both elements. Structuring it like this for the training is
useful for conveying particular points in the training.
Additionally, we are checking that when an invalid path is used we get an error.
test_that("assemble_crime_data works with valid path", {
uri <- "s3://alpha-r-training/r-package-training/synthetic-crime-data.parquet"
assemble_crime_data(uri, year = 2000) |> expect_s3_class("data.frame")
assemble_crime_data(uri, year = 2001) |> expect_no_warning()
})
test_that("assemble_crime_data fails with invalid path", {
assemble_crime_data("foo", year = 2001) |> expect_error()
})
devtools::load_all()
.devtools::test()
- you will get feedback as the tests run about how many have failed, resulted in a warning, or passed.Test coverage is a metric that can be useful in assessing the adequacy of tests. The {covr} package can be used to examine test coverage. It builds the package and runs the tests in a modified environment counting how many times each line of package code is run by the tests. You should aim to have every line covered by tests but don't rely on coverage alone when assessing the adequacy of tests. When we run the test coverage of our package we will get 100% (the wrangle data function is called by the assemble crime data function) but we are not (yet) properly testing the intended behaviour of the wrangle data function.
Test coverage can be particularly useful where you have if()
statements in your code to help you
ensure that all the various conditions that can arise have been covered. For example, if the
assemble crime data function did something special when the year was set to 2002 those lines
would not be covered by our existing text and this would be revealed by examining the test coverage.
if (year == 2002) {
message("Happy 2002!")
}
devtools::test_coverage()
- the first time you run this you might be prompted to install the packages {covr} and {DT}.In order to properly test the wrangle data function we probably want to ensure that the following exceptions are met in the output data frame:
crime
and twelve for the months)pub_year
correctlyWe probably don't want to use "real" data when writing tests. By checking specific things like values, number of rows, number of columns etc in the outputs there is a risk of revealing unpublished information. Real data may also be subject to change (potentially causing tests to fail incorrectly). Additionally, real data is likely to be quite large (slowing down the testing process) and contain a lot of noise i.e. elements that are not relevant for testing a specific function.
We will use the following data frame to test the wrangle data function. It contains only the three
columns used by the test and two rows. The values for crime
are dummy values i.e. not the same as
the values used in the "real" data but that difference is not important for testing whether the
function works.
testing_df <- data.frame(
crime = c("foo", "bar"),
year = 2000:2001,
month = "January"
)
testing_df
data frame in the test and then add expectations to test the four points listed above.devtools::check()
- this will also run the tests alongside the other checks.The README acts as a "quick-start guide" for users of your package. It should include:
You can use a simple markdown README or dynamically generate one using R Markdown which
enables the ability to embed code chunks and several other extensions useful for writing
technical reports. The latter may be preferable if you want to demonstrate what some of
your code does. You can add a README with either usethis::use_readme_md()
or
usethis::use_readme_rmd()
depending on the type you want.
renv::install("git@github.com:moj-analytical-services/PACKAGE.git")
(you will need to replace "PACKAGE" with the name of your package). You can also remove the line about installing a "development" version.devtools::check()
- if all the checks pass commit and push the README.The NEWS markdown file functions as a change-log for your package. It must be updated every time you make changes to your package.
usethis::use_news_md()
). devtools::check()
- if all the checks pass commit and push the NEWS file.Congratulations, you have successfully produced a working package in R! Open a pull request and
merge it to the main
branch.
GitHub Releases are a great way to manage the versions of your package. Every time you release an updated version of your package, include a GitHub release. This way if you ever need an older version of your package it is very easy to install using the GitHub Release Tag.
dev
branch into main
(delete the dev
branch once it is merged)0.1.0
the tag will be v0.1.0
. After
typing the tag you will need to click on "Create new tag: ... on publish".To install a package from a public GitHub repo using renv
you just need the owner and the
repo:
renv::install("moj-analytical-services/mojchart")
The easiest way install a package from an internal or private GitHub repo is with the following (SSH URL) syntax:
renv::install("git@github.com:moj-analytical-services/mojchart.git")
Note: If your package has any Imports that are from internal or private repos you will need to also use this syntax in the Remotes field. For example the {psutils} package has {verify} as an import which is another internal package available from this SSH remote.
With renv
>= 0.15.0
you can also include @ref
on the end of the URL where the "ref" is a
branch name, commit or github tag e.g.
renv::install("git@github.com:moj-analytical-services/verify.git@v0.0.19")
You have released your package and have received some feedback from a user - "it would be better if the year was also included in the date column headings".
dev
branch (if you first need to remove the existing one, run git branch -d dev
in the terminal)renv::install()
. This function has special behavior in the presence of a
DESCRIPTION file - it will install the packages listed there. This behaviour is bugged in some versions of
{renv}. If you get an error message, run renv::install("renv@0.15.4")
, restart R (Ctrl+Shift+F10) then try again.devtools::check()
. This is to see if any changes in your packages dependencies have broken
anything (the effectiveness of this will depend on the quality of your code and testing). Address
any dependency related issues before making further changes.dplyr::mutate()
in wrangle_data()
:
month_fct = forcats::fct_relabel(.data$month_fct, ~ paste(.x, pub_year))
devtools::load_all()
and devtools::test()
devtools::check()
main
and generate a new GitHub releaseContinuous integration is about automating software workflows. An automated workflow can be
setup so that when you or someone else pushes changes to github.com, tests are run to
ascertain whether there are any problems. These checks should include the unit tests you've
developed and also the R CMD tests (over 50 individual checks for common problems) carried
out when you run devtools::check()
.
Before setting up this automation, you should have fixed any problems identified by running the R CMD tests - see Section 7 - Checking your package.
To setup continuous integration using GitHub Actions:
usethis::use_github_actions()
This automatically puts a status badge in your README.
You can read further about automating checking in R Packages Automated Checking chapter.
test_that("wrangle_data works", {
testing_df <- data.frame(
crime = c("foo", "bar"),
year = 2000:2001,
month = "January"
)
out_df_1 <- testing_df |> wrangle_data(pub_year = 2000)
out_df_1 |> ncol() |> expect_equal(13)
out_df_1 |> names() |> tail(12) |> expect_equal(month.name)
out_df_1$crime |> expect_equal("foo")
out_df_2 <- testing_df |> wrangle_data(pub_year = 2001)
out_df_2$crime |> expect_equal("bar")
})
Most R packages you install come from CRAN (The Comprehensive R Archive Network) which stores them on a series of mirrored servers that act as package repositories. Prior to R version 4.4.0 the Analytical Platform is set up to use a fixed R package repository by default. Depending on the version of R on the Analytical Platform you are using, this may be fairly old. Run options("repos") in the console and look at the date at the end to see which version you are using. To access the latest versions of packages you can use the following to update where you install from (this will reset when R is restarted).
options(repos = "https://packagemanager.rstudio.com/all/__linux__/focal/latest")