Institute for Ophthalmic Research, University of Tübingen, Germany
October 25, 2018
Abstract
Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for
thousands of genes from up to hundreds of thousands of cells. Common data analysis pipelines
include a dimensionality reduction step for visualising the data in two dimensions, most frequently
performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local
structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g.
the global structure of the data is not represented accurately. Here we describe how to circumvent
such pitfalls, and explain a protocol for successful exploratory data analysis using t-SNE. They
include PCA initialisation, multi-scale similarity kernels, exaggeration, and downsampling-based
initialisation for very large data sets. We use published single-cell RNA-seq data sets to demonstrate
that this protocol yields superior results compared to the naive application of t-SNE.
The art of using t-SNE for single-cell transcriptomics:
Abstract