ropensci / paleobioDB

R interface to the Paleobiology Database
https://docs.ropensci.org/paleobioDB
42 stars 17 forks source link
r r-package rstats

paleobioDB

CRAN
status rstudio mirror
downloads R-CMD-check

paleobioDB is a package for downloading, visualizing and processing data from the Paleobiology Database.

Installation

Install the latest release from CRAN:

install.packages("paleobioDB")

Alternatively, you can install the development version of paleobioDB from GitHub with:

# install.packages("devtools")
devtools::install_github("ropensci/paleobioDB")

General overview

paleobioDB has 19 functions to wrap most endpoints of the PaleobioDB API, plus 8 functions to visualize and process the fossil data. The API documentation for the Paleobiology Database can be found here.

Download fossil occurrences from PaleobioDB

pbdb_occurrences

Here is an example of how to download all fossil occurrences that belong to the family Canidae in the Quaternary:

library(paleobioDB)

canidae <- pbdb_occurrences(
  base_name = "canidae",
  interval = "Quaternary",
  show = c("coords", "classext"),
  vocab = "pbdb",
  limit = "all"
)

dim(canidae)
#> [1] 1384   28

head(canidae, 3)
#>   occurrence_no record_type collection_no   identified_name identified_rank
#> 1        150070         occ         13293          Cuon sp.           genus
#> 2        192926         occ         19617    Canis edwardii         species
#> 3        192927         occ         19617 Canis armbrusteri         species
#>   identified_no     accepted_name accepted_rank accepted_no     early_interval
#> 1         41204              Cuon         genus       41204 Middle Pleistocene
#> 2         44838    Canis edwardii       species       44838       Irvingtonian
#> 3         44827 Canis armbrusteri       species       44827       Irvingtonian
#>      late_interval max_ma min_ma reference_no       lng      lat   phylum
#> 1 Late Pleistocene  0.781 0.0117         4412  111.5667 22.76667 Chordata
#> 2             <NA>  1.800 0.3000         2673 -112.4000 35.70000 Chordata
#> 3             <NA>  1.800 0.3000        52058 -112.4000 35.70000 Chordata
#>   phylum_no    class class_no     order order_no  family family_no genus
#> 1     33815 Mammalia    36651 Carnivora    36905 Canidae     41189  Cuon
#> 2     33815 Mammalia    36651 Carnivora    36905 Canidae     41189 Canis
#> 3     33815 Mammalia    36651 Carnivora    36905 Canidae     41189 Canis
#>   genus_no reid_no difference
#> 1    41204    <NA>       <NA>
#> 2    41198    8376       <NA>
#> 3    41198   30222       <NA>

Note that if the plotting and analysis functions of this package are going to be used (as demonstrated in the sections below), it is necessary to specify the parameter show = c("coords", "classext") in the pbdb_occurrences() function. This returns taxonomic and geographic information for the occurrences that is required by these functions.

Caution with the raw data

Beware of synonyms and errors, they could twist your estimations about species richness, evolutionary and extinction rates, etc. paleobioDB users should be critical about the raw data downloaded from the database and filter the data before analyzing it.

For instance, when using base_name for downloading information with the function pbdb_occurrences(), check out the synonyms and errors that could appear in accepted_name, genus, etc. If they are not corrected or eliminated, they will increase the richness of genera.

Map the fossil records

pbdb_map

Plots a map showing fossil occurrences and invisibly returns a data frame with the number of occurrences per coordinate.

(pbdb_map(canidae))

#>                               lng        lat Occur
#> -70.583885.-52.415833  -70.583885 -52.415833     1
#> -69.583336.-52.166668  -69.583336 -52.166668     1
#> -70.060555.-52.054722  -70.060555 -52.054722     1
#> -70.174446.-51.74778   -70.174446 -51.747780     1
#> -72.599998.-51.051388  -72.599998 -51.051388     1
....
#> 22.033333.46.966667     22.033333  46.966667    11
#> 11.25.45.416668         11.250000  45.416668    12
#> -117.33.900002        -117.000000  33.900002    13
#> 5.395.43.686111          5.395000  43.686111    14
#> -105.699997.39.299999 -105.699997  39.299999    15
#> 27.7208.-26.016701      27.720800 -26.016701    18

pbdb_map_occur

Returns a map and a raster object with the sampling effort (number of fossil records per cell). The user can change the resolution of the cells.

pbdb_map_occur(canidae, res = 5)

#> class       : SpatRaster 
#> dimensions  : 34, 74, 1  (nrow, ncol, nlyr)
#> resolution  : 5, 5  (x, y)
#> extent      : -180, 190, -85.19218, 84.80782  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 
#> source(s)   : memory
#> name        : sum 
#> min value   :   1 
#> max value   :  67

pbdb_map_richness

Returns a map and a raster object with the number of different species, genera, family, etc. per cell. As with pbdb_map_occur(), the user can change the resolution of the cells.

pbdb_map_richness(canidae, res = 5, rank = "species")

#> class       : SpatRaster 
#> dimensions  : 34, 74, 1  (nrow, ncol, nlyr)
#> resolution  : 5, 5  (x, y)
#> extent      : -180, 190, -85.19218, 84.80782  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 
#> source(s)   : memory
#> name        : sum 
#> min value   :   1 
#> max value   :  14

If you do not need the plot and you are only interested in obtaining a richness raster for some other purposes, you could use the argument do_plot = FALSE. For instance, this returns the same raster object as above (a SpatRaster object from the terra package) without plotting it:

pbdb_map_richness(canidae, res = 5, rank = "species", do_plot = FALSE)
#> class       : SpatRaster 
#> dimensions  : 34, 74, 1  (nrow, ncol, nlyr)
#> resolution  : 5, 5  (x, y)
#> extent      : -180, 190, -85.19218, 84.80782  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 
#> source(s)   : memory
#> name        : sum 
#> min value   :   1 
#> max value   :  14

The do_plot argument is available in all the functions in the package that produce a plot and return an object. This means that the plot is optional in all the other plotting functions that are described here. Check their documentation for more details.

Explore your fossil data

pbdb_temp_range

Returns a data frame and a plot with the time span of the species, genera, families, etc. in your query. Make sure that enough vertical space is provided in the graphics device used to do the plotting if there are many taxa of the specified rank in your data set.

pbdb_temp_range(canidae, rank = "species")

#>                             max    min
#> Canis lepophagus          4.900 0.0120
#> Canis gezi                4.000 0.7810
#> Canis chihliensis         3.600 0.7810
#> Canis palmidens           3.600 0.7810
#> Eucyon minor              3.600 0.7810
....
#> Urocyon littoralis        0.300 0.0000
#> Pseudalopex sechurae      0.126 0.0117
#> Vulpes macrotis           0.126 0.0117
#> Cubacyon transversidens   0.126 0.0000
#> Lycalopex griseus         0.126 0.0000
#> Speothos pacivorus        0.126 0.0000

pbdb_richness

Returns a data frame and a plot with the number of species (or genera, families, etc.) across time. You should set the temporal extent and the temporal resolution for the steps.

pbdb_richness(canidae, rank = "species", temporal_extent = c(0, 5), res = 0.5)

#>    temporal_intervals richness
#> 1               0-0.5       74
#> 2               0.5-1       87
#> 3               1-1.5       78
#> 4               1.5-2       81
#> 5               2-2.5       71
#> 6               2.5-3       71
#> 7               3-3.5       18
#> 8               3.5-4       15
#> 9               4-4.5        1
#> 10              4.5-5        1

pbdb_orig_ext

Returns a data frame and a plot with the number of appearances and dissapearances of taxa between consecutive time intervals in the data you provide. These time intervals are defined by the temporal extent (temporal_extent) and resolution (res) arguments. This is another way of visualizing the same information that is shown in the pbdb_temp_range() plot. orig_ext = 1 plots new appearances:

pbdb_orig_ext(
  canidae,
  rank = "species",
  orig_ext = 1, temporal_extent = c(0, 5), res = 0.5
)

#>                new ext
#> 0.5-1 to 0-0.5   6  19
#> 1-1.5 to 0.5-1   9   0
#> 1.5-2 to 1-1.5   2   5
#> 2-2.5 to 1.5-2  10   0
#> 2.5-3 to 2-2.5   0   0
#> 3-3.5 to 2.5-3  53   0
#> 3.5-4 to 3-3.5   3   0
#> 4-4.5 to 3.5-4  14   0
#> 4.5-5 to 4-4.5   0   0

And orig_ext = 2 plots disappearances of taxa between time intervals in the provided data frame.

pbdb_orig_ext(
  canidae,
  rank = "species",
  orig_ext = 2, temporal_extent = c(0, 5), res = 0.5
)

#>                new ext
#> 0.5-1 to 0-0.5   6  19
#> 1-1.5 to 0.5-1   9   0
#> 1.5-2 to 1-1.5   2   5
#> 2-2.5 to 1.5-2  10   0
#> 2.5-3 to 2-2.5   0   0
#> 3-3.5 to 2.5-3  53   0
#> 3.5-4 to 3-3.5   3   0
#> 4-4.5 to 3.5-4  14   0
#> 4.5-5 to 4-4.5   0   0

pbdb_subtaxa

Returns a plot and a data frame with the number of species, genera, families, etc. in your dataset.

pbdb_subtaxa(canidae)

#>   species genera families orders classes phyla
#> 1      98     27        1      1       1     1

pbdb_temporal_resolution

Returns a plot and a list with a summary of the temporal resolution of the fossil records.

pbdb_temporal_resolution(canidae)

#> $summary
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0117  0.1143  0.6550  0.9617  1.8070  4.8880 
#> 
#> $temporal_resolution
#>    [1] 0.7693 1.5000 1.5000 1.5000 0.7820 0.7820 0.7820 0.7820 1.5000 1.5000
....
#> [1331] 1.8070 1.8070 1.8070 1.8070 1.8070 2.8190 2.8190 2.8190 2.8190 2.8190
#> [1341] 0.1260 2.5763 1.8070 0.4190 0.4190 0.4190 0.4190 0.4190 3.2190 0.7690
#> [1351] 0.0117 0.0117 0.0117 2.5880 2.5763 2.5763 0.7690 0.7690 0.7690 0.1143
#> [1361] 0.4190 0.1143 0.7690 2.5763 0.1143 0.7690 0.1143 2.5763 0.7690 0.6550
#> [1371] 0.6550 1.8070 2.8190 1.5000 2.5763 2.5763 2.5763 0.1143 2.5763 2.5763
#> [1381] 2.5763 2.5763 0.0117 0.0117

Docker

We include a Dockerfile to ease working on the package as it fulfills all its system dependencies.

How to load the package with Docker:

  1. Install Docker. Reference here: https://docs.docker.com/get-started/

  2. Build the docker image. From the root folder of this repository, type:

docker build -t rpbdb Docker

This command will create a docker image in your system based on some of the rocker/tidyverse images. You can see the new image with docker image ls.

  1. Start a container for this image. Type the following command picking some <password> of your choice.
docker run -d --name="rpbdb_rstudio" --rm -p 8787:8787 \
       -e PASSWORD=<password> -v $PWD:/home/rstudio rpbdb

This will start a container with access to your current folder where all the code of the package is. Inside the container, the code will be located in /home/rstudio. It also exposes the port 8787 of the container so you may access the RStudio web application which is bundled in the Rocker base image.

  1. Then, you can either:

    • Navigate to http://localhost:8787. Enter with username “rstudio” and the password you used in the command above.

    • Or you can enter the container via console with:

    docker exec -ti -u rstudio -w /home/rstudio rpbdb_rstudio R

Either from RStudio or from within the container you can install the package from source with:

install.packages(".", repos = NULL, type = "source")
  1. When you are finished, you can stop the container:
docker container stop rpbdb_rstudio

Meta

Please report any issues or bugs.

License: GPL-2

#> To cite package 'paleobioDB' in publications use:
#> 
#>   Varela S, González-Hernández J, Sgarbi LF, Marshall C, Uhen MD,
#>   Peters S, McClennen M (2015). "paleobioDB: an R package for
#>   downloading, visualizing and processing data from the Paleobiology
#>   Database." _Ecography_, *38*(4), 419-425. doi:10.1111/ecog.01154
#>   <https://doi.org/10.1111/ecog.01154>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {paleobioDB: an R package for downloading, visualizing and processing data from the Paleobiology Database.},
#>     author = {Sara Varela and Javier González-Hernández and Luciano F. Sgarbi and Charles Marshall and Mark D. Uhen and Shanan Peters and Michael McClennen},
#>     journal = {Ecography},
#>     year = {2015},
#>     volume = {38},
#>     number = {4},
#>     pages = {419-425},
#>     doi = {10.1111/ecog.01154},
#>   }

This package is part of the rOpenSci project.