satijalab / seurat-data

Dataset distribution for Seurat
GNU General Public License v3.0
129 stars 46 forks source link
datasets single-cell single-cell-genomics

SeuratData

SeuratData is a mechanism for distributing datasets in the form of Seurat objects using R's internal package and data management systems. It represents an easy way for users to get access to datasets that are used in the Seurat vignettes.

Installation

Installation of SeuratData can be accomplished through devtools

devtools::install_github('satijalab/seurat-data')

Getting Started

When loading SeuratData, a list of all available datasets will be displayed (this is similar to other metapackages like tidyverse along with the version of Seurat used to create each dataset). This message can be suppressed with suppressPackageStartupMessages

> library(SeuratData)
── Installed datasets ───────────────────────────────────────────────────────────── SeuratData v0.1.0 ──
✔ cbmc   3.0.0                                           ✔ panc8  3.0.0
✔ ifnb   3.0.0                                           ✔ pbmc3k 3.0.0

───────────────────────────────────────────────── Key ──────────────────────────────────────────────────
✔ Dataset loaded successfully

To see a manifest of all available datasets, use AvailableData; this manifest will update as new datasets are uploaded to our data repository.

> AvailableData()
                     Dataset Version                                                        Summary species            system ncells                                                            tech         notes Installed InstalledVersion
cbmc.SeuratData         cbmc   3.0.0                   scRNAseq and 13-antibody sequencing of CBMCs   human CBMC (cord blood)   8617                                                        CITE-seq          <NA>      TRUE            3.0.0
hcabm40k.SeuratData hcabm40k   3.0.0 40,000 Cells From the Human Cell Atlas ICA Bone Marrow Dataset   human       bone marrow  40000                                                          10x v2          <NA>     FALSE            3.0.0
ifnb.SeuratData         ifnb   3.0.0                              IFNB-Stimulated and Control PBMCs   human              PBMC  13999                                                          10x v1          <NA>      TRUE            3.0.0
panc8.SeuratData       panc8   3.0.0               Eight Pancreas Datasets Across Five Technologies   human Pancreatic Islets  14892                SMARTSeq2, Fluidigm C1, CelSeq, CelSeq2, inDrops          <NA>      TRUE            3.0.0
pbmc3k.SeuratData     pbmc3k   3.0.0                                     3k PBMCs from 10X Genomics   human              PBMC   2700                                                          10x v1          <NA>      TRUE            3.0.0
pbmcsca.SeuratData   pbmcsca   3.0.0           Broad Institute PBMC Systematic Comparative Analysis   human              PBMC  31021 10x v2, 10x v3, SMARTSeq2, Seq-Well, inDrops, Drop-seq, CelSeq2 HCA benchmark     FALSE            3.0.0

Installation of datasets can be done with InstallData; this function will accept either a dataset name (eg. pbmc3k) or the corresponding package name (eg. pbmc3k.SeuratData). InstallData will automatically attach the installed dataset package so one can immediately load and use the dataset.

> InstallData("pbmc3k")

Loading a dataset is done using the data function

> data("pbmc3k")
> pbmc3k
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features)

Dataset documentation and information

All datasets provided have help pages built for them. These pages are accessed using the standard help function

> ?pbmc3k
> ?ifnb

A full command list for the steps taken to generate each dataset is present in the examples section of these help pages.

Packages will also often have citation information bundled with the package. Citation information can be accessed by passing the package name, not the dataset name, to the citation function

> citation('cbmc.SeuratData')

To cite the CBMC dataset, please use:

  Stoeckius et al. Simultaneous epitope and transcriptome measurement in
  single cells. Nature Methods (2017)

A BibTeX entry for LaTeX users is

  @Article{,
    author = {Marlon Stoeckius and Christoph Hafemeister and William Stephenson and Brian Houck-Loomis and Pratip K Chattopadhyay and Harold Swerdlow and Rahul Satija and Peter Smibert},
    title = {Simultaneous epitope and transcriptome measurement in single cells},
    journal = {Nature Methods},
    year = {2017},
    doi = {10.1038/nmeth.4380},
    url = {https://www.nature.com/articles/nmeth.4380},
  }

Rationale and Implementation

We created SeuratData in order to distribute datasets for Seurat vignettes in as painless and reproducible a way as possible. We also wanted to give users the flexibility to selectively install and load datasets of interest, to minimize disk storage and memory use.

To accomplish this, we opted to distribute datasets through individual R packages. Under the hood, SeuratData uses and extends standard R functions, such as install.packages for dataset installation, available.packages for dataset listing, and data for dataset loading.

SeuratData therefore serves as a more specific package manager (similar to a metapackage) for R. We provide wrappers around R's package management functions, extend them to provide relevant metadata about each dataset, and set default settings (for example, the repository where data is stored) to facilitate easy installation.