resampling-stats / resampling-with

Source repository for third edition of "Resampling: The New Statistics"
https://resampling-stats.github.io
Other
13 stars 9 forks source link

Resampling statistics, third edition

Material for updated (third) edition of Resampling: The New Statistics, second edition by Julian L. Simon.

The new edition is by Julian L. Simon with Matthew Brett and Stéfan van der Walt.

The latest version will always be at the book website.

We release the material in this repository under a CC-BY-ND license, unless otherwise specified. See LICENSE.md in this directory for details.

The source text that we build to the book is in the source directory.

There are source (Markdown) versions of chapters from the second edition in the unported directory. As we fill out the third edition, we move these files into the source directory, and edit them there.

Setup for editing and proofing

Create and switch to virtual environment

Typically, you will want to install Python build dependencies in a virtual environment. You can place such an environment anywhere you like. This is how to create it in ~/envs:

mkdir ~/envs
python -m venv ~/envs/resampling

Install build dependencies

export PIP_INSTALL_CMD="pip install"
make build-init

You will also need rsvg-convert, inkscape, and pandoc. On macOS, those can be installed with:

brew install librsvg inkscape pandoc

On Fedora, with:

sudo dnf install R-rsvg inkscape pandoc

See the Pandoc installation guide for suggestions to install rsvg-convert on other platforms.

Make sure that your rmarkdown package is sufficiently up to date to work with your pandoc version. Versions of pandoc >= 2.11 use --citeproc and not --filter pandoc-citeproc; if your rmarkdown version is older than 2.5 (library(rmarkdown); sessionInfo()), it won't know that, and therefore will raise an error on book build - see RMarkdown release notes. Upgrade with install.packages('rmarkdown').

Quarto

We use Quarto as the build machinery for the website and for PDF.

See the Quarto installation instructions. Afterwards, install the matching R package.

Rscript -e 'install.packages("quarto")'

If it complains about the CRAN mirror not set, add the following to ~/.Rprofile and try again:

local({r <- getOption("repos")
       r["CRAN"] <- "http://cran.r-project.org"
       options(repos=r)})

The process may fail if it cannot find curl and openssl development headers. The error message explains how to install those headers on the various systems. For example, on Fedora it'd be:

sudo dnf install libcurl-devel openssl-devel

Finally, check the installation:

quarto check install
quarto check knitr

Quarto uses various Pandoc markdown extensions, as do we (Div and Span elements for custom inline elements and blocks).

Writing and updating the text

Follow the build instructions above.

Ensure that your virtual environment is activated:

source ~/envs/resampling-with/bin/activate

Make sure you can build the whole book in your current environment with:

make clean && make python-book
make clean && make r-book

from the top-level repository directory. If this doesn't work, make an Issue on Github.

Be careful - and note the make cleans above - it seems that caching can trip up the build. In general, try make clean if you run into puzzling build problems with data frames not defined, that are clearly defined, or missing imports that are not missing.

After you've confirmed you can build both the Python and the R edition, you may want to work on only one of these editions — say the Python book, and clean up the R book later (or the other way round).

Matthew and Peter know R reasonably well — we can help with R cleanup.

Starting work on a new chapter

See the ./source/_quarto.yml.template file for a list of the chapters currently in the book build.

Let's say you want to start work on one of the chapters, and you've see this in the _quarto.yml.template file:

    - reliability_average.Rmd

The procedure is:

Before you start:

Editing:

Initial port checklist

Notebooks

Make a notebook section with e.g.

::: {.notebook name="ambulances" title="Ambulances"}

Stuff

```{python}
# A Python cell - it only gets run in the Python edition.
k = 1
```

```{r}
# An R cell - it only gets run in the R edition
k <- 1
```

The value of `k` is `r get_var('k')`.

:::

Content that should go only in the output notebooks

The notebooks (above) get written out as separate documents. By default, they may lack some context, if the reader can't see the preceding text in the chapter. So, sometimes it's useful to put an introductory paragraph at the top of the notebook to give the context, but where that paragraph does not appear in the main text, like this:

::: {.notebook name="another_notebook" title="Another Notebook"}

::: nb-only
This appears only in the output notebook file, and not in the main text.
:::

```{python}
# A Python cell.
k = 1
```

```{r}
# An R cell.
k <- 1
```
:::

Edition-specific content

You have inline and block markup to put content into just the R edition, or just the Python edition:

::: r

This only appears in the R edition
:::

::: python

This only appears in the Python edition
:::

This content appears [only in the R edition]{.r}[just in the Python
book]{.python}.

If you find yourself doing that often, you can define a version-dependent variable in the text_variables.yml file. Use it with e.g.

Here I have text that depends on the version of the book — {{< var my_var >}}
— as determined by the `text_variables.yml` file.

Citations

Citations are in Pandoc format, as implemented in Quarto's citations.

Check that the reference is not already in source/simon_refs.bib. Add it if so, following reference name standard in that file (e.g. @article{christensen2005fisher,). Cite with e.g. This is a terrible idea [@christensen2005fisher] or As Christensen notes [-@christensen2005fisher], or There are many good ways to do this [see @knuth1984, pp. 33-35; also @wickham2015, chap. 1]. See Quarto link above for other examples.

Footnotes

See Quarto footnotes)

Examples (from that page):

Here is a footnote reference,[^1] and another.[^longnote]

[^1]: Here is the footnote.

[^longnote]: Here's one with multiple blocks.

    Subsequent paragraphs are indented to show that they
    belong to the previous footnote.

Here is an inline note.^[Inlines notes are easier to write,
since you don't have to pick an identifier and move down to
type the note.]

Notice that you'll need full 4-space indentation to keep the footnote paragraphs within the footnote.

Writing notes for the reader (callout)

See Quarto callouts

:::{.callout-note}
## A title for the note

Some text
:::

Also .callout-warning, -tip, -important, -caution.

Comments to your fellow authors

HTML comments:

<!---
More here on something and something else
-->

Note the triple dash in the first line, as in <!---.

Cross-references

See Cross-references in Quarto. Summary for section reference: add {#sec-name-for-your-ref} to the target section title, reference with Please see section @name-for-your-ref for details.

Tables

Basic tables

| Col1 | Col2 | Col3 |
|------|------|------|
| A    | B    | C    |
| E    | F    | G    |
| A    | G    | G    |

: My Caption {#tbl-letters}

See @tbl-letters.

Pipe tables

These support per-column alignment:

| Default | Left | Right | Center |
|---------|:-----|------:|:------:|
| 12      | 12   |    12 |   12   |
| 123     | 123  |   123 |  123   |
| 1       | 1    |     1 |   1    |

: A pipe table {#tbl-pipe}

See @tbl-pipe.

See Quarto tables

Grid tables

These are the most flexible, but are fiddly to edit because they care about spacing. Use these for headers, footer, fusing cells, etc.

See: Pandoc grid tables and inference_ideas.Rmd for examples.

There is a good summary of Markdown tables in this page

Loading .Rmd files as Jupyter notebooks

Install Jupyter:

pip install jupyter

Start jupyterlab in the source directory, right-click on an .Rmd file, and "Open as notebook". If this option does not appear, ensure that Jupytext is installed.

SVG figures

SVG figures need to be converted to PNG for the HTML build and PDF for the PDF build. We automate this with the build system. To use the automation:

include_svg('diagrams/ships_gold_silver.svg')

Where ships_gold_silver.svg is the file in diagrams, fi-ships-gold-silver is the reference label for the figure, opts.label='svg_fig' sets the correct display options for the figure, and fig.cap give the caption.

You can then refer to the figure elsewhere (for this example) as @fig-ships-gold-silver.

Search for include_svg in the source directory for other examples.

Support code

Sometimes we generate figures and tables on the fly using code that should not appear in the final book. For examples, see more_sampling_tools.Rmd.

The code could be Python or R.

By default, the build system executes Python code chunks only in the Python edition, and R code chunks in the R edition. For supporting code, to be run in both editions, you need to tell Quarto (in fact, Knitr) to always execute the code (regardless of whether we are currently building the R or Python edition); use the option eval=TRUE.

Nearly always, you will want to hide the source code for those chunks. Use the chunk option echo=FALSE to do that.

Sometimes (sometimes not) you will want to suppress the output. You can add results=FALSE to the options to suppress the output, or use include=FALSE to suppress the code and output (equivalent to echo=FALSE, results=FALSE).

Here's an example, from intro.Rmd:

```{python, eval=TRUE, echo=FALSE}
import os.path as op
import numpy as np
import pandas as pd
lake = pd.read_csv(op.join('data', 'lough_erne.csv'))
yearly_srp = lake.loc[:, ['Year', 'SRP']].copy()
```

```{r, label="tbl-yearly-srp", eval=TRUE, echo=FALSE}
ketable(py$yearly_srp,
        caption = "Soluble Reactive Phosphorus in Lough Erne")
```

Notice in the case above, that there is no output from the first chunk (so we don't need to suppress it with results=FALSE or include=FALSE). We do want the output results from the second chunk.

See the Knitr chunk options documentation for more detail.

You can use the kableExtra::column_spec options to tune table formatting — see resampling_method.Rmd for an example.

More setup for Jupyter

For the Jupyter notebook, you might want to enable the R magics, to allow you to run both the R code and the Python code, in the same notebook, without any extra cells.

First, install rpy2 into your virtualenv:

pip install rpy2

Find your IPython default configuration directory. From Jupyter:

get_ipython().profile_dir.startup_dir

Then make a file in that directory with name such as 03_rpy2.ipy, with contents including:

try:
    import rpy2
except ImportError:
    pass
else:
    %load_ext rpy2.ipython

The .ipy extension is to allow the % magic commands. Now you should be able to work with the %%R cell magics.

Useful links

Notes for concepts in other discussions in the book

See the notes repository for more discussions of various concepts in the book, and how we are thinking about them.

Chapter mappings

These are the mappings between the files in the ./source directory, and the original chapters from the second edition website. You can find basic Markdown ports of the original second edition PDF chapters in the ./unported directory of the repository.

See also ./source/_quarto.yml.template for files making up chapters in current built book.

Third edition file Second edition file(s) Third edition chapter title
preface_third.Rmd N/A Preface to the third edition
preface_second.Rmd 01-Preface Preface to the second edition
intro.Rmd 02-Intro, 04-Afternote-2 Introduction
monty_hall.Rmd N/A N/A
dramatizing_resampling.Rmd 03-Afternote-1 N/A
resampling_method.Rmd 05-Chap-1 The resampling method
about_technology.Rmd N/A Introducing Python ...
resampling_with_code.Rmd N/A Resampling with code
resampling_with_code2.Rmd N/A More resampling with code
what_is_probability.Rmd 06-Chap-2, 07-Chap-3 What is probability?
probability_theory_1a.Rmd 08-Chap-4 NA
probability_theory_1b.Rmd 09-Chap-5 NA
probability_theory_2_compound.Rmd 10-Chap-6 NA
probability_theory_3.Rmd 11-Chap-7 NA
probability_theory_4_finite.Rmd 12-Chap-8 NA
sampling_variability.Rmd 13-Chap-9 NA
monte_carlo.Rmd 14-Chap-10 NA
inference_ideas.Rmd 15-Chap-11 NA
inference_intro.Rmd 16-Chap-12 NA
point_estimation.Rmd 17-Chap-13 NA
framing_questions.Rmd 18-Chap-14 NA
testing_counts_1.Rmd 19-Chap-15 NA
significance.Rmd 20-Chap-16 NA
testing_counts_2.Rmd 21-Chap-17 NA
testing_measured.Rmd 22-Chap-18 NA
testing_procedures.Rmd 23-Chap-19 NA
confidence_1.Rmd 24-Chap-20 NA
confidence_2.Rmd 25-Chap-21 NA
reliability_average.Rmd 26-Chap-22 NA
correlation_causation.Rmd 27-Chap-23 NA
how_big_sample.Rmd 28-Chap-24 NA
bayes_simulation.Rmd 29-Chap-25 NA
exercise_solutions.Rmd 30-Exercise-sol NA
acknowlegements.Rmd acknow NA
technical_note.Rmd Technical NA

Initial text for this table generated using:

grep ed2_fname *.Rmd | grep -v _main | sed 's/:.*ed2_fname//' | sort -t ':' -k 2 -