shahcompbio / signals

Single Cell Genomes with Allele Specificity
Other
8 stars 1 forks source link
bioinformatics rna-seq single-cell

signals

R build status Codecov test coverage Docker DOI R-CMD-check

signals (single cell genomes with allele specificity) is a tool to estimate allele and haplotype specific copy number states in single cells with low coverage (0.01-0.1X). signals phases alleles based on losses and gains across all cells and then assigns allele specific states for each bin in each cell using a hidden markov model. Documentation is available here.

You can read more about signals in our paper.

Installation

You can install signals with the following command

devtools::install_github("shahcompbio/signals")

A docker image is available here.

Input data

signals was developed to work with Direct Library Preperation + (DLP+) data. A high throughput single cell whole genome sequencing workflow, described in Laks et al.. As such it works using the output of the the mondiran pipeline developed to process this type of data, available here. Despite being developed with this type of data and pipeline in mind, it should work well with other single cell whole genome technologies. We have run it successfully with 10X CNV data for example. The required inputs are total copy number estimates in bins across the genome and haplotype block counts per cell (SNPs can also work). See the test datasets provided with the package for example inputs. If you have a different type of technology and would like some advice or help running signals please open an issue. We describe in more detail the necessary input below.

DLP+ data

You will need the HMM copy results table (CNbins) with the following columns: chr, start,end, cell_id, state, copy. state is the inferred total copy number state. copy values are GC-correceted, ploidy corrected normalized read counts (this is what HMMcopy uses to infer the total copy number states). You will also need the cell specific haplotype counts (allele_data) as outputted by the inferhaps and couthaps sub commands in the pipeline. This includes the following columns: chr, start,end, cell_id, hap_label, allele_id, readcount. allele_id identifies the 2 alleles. hap_label is a label given to the haplotype block, these labels are consistent across cells, and in combination with chr gives each block a unique ID. These blocks are computed by finding SNP's that can be confidently phased together. See the section of "phasing uncertainty" in SHAPEIT here. This notion of phasing uncertainty is also used in remixt. Note that this notion of haplotype block is different to other tools such as CHISEL where it is assumed that the the phasing remains consistent over seem specified length such as 50kb. Finally the readcount columns gives the sum of the number of reads for each allele in each block.

Other technologies

Other technologies and software should also be compatible with signals. For example, we have used 10X data successfully. If you have single cell bam files or fastq files see the detailed documentation for running our single cell pipeline here. Alternatively, we provide a lightweight snakemake pipeline with the key steps here. Also included there are some scripts to demultiplex 10X CNV bams.

If you total copy number calls from other software such as 10X cellranger or scope, these may also work but not something we have tried. Feel free to open an issue if you need some advice.

Data conventions

Some of the plotting tools assume that the cell_id's conform to the following naming conventions

{sample}-{library}-{R*}-{C*}

Here R and C refer the rows and columns of the chip used in the DLP protocol. If you're using another technology and your cells are named differently, we would reccomend renaming your cell's for easy compatibility. For example, if you have 10X data where cell_id's are barcodes that look like CCGTACTTCACGGTAT-1 something like this would work

new_cell_id <- paste("mysample", "mylibrary", "CCGTACTTCACGGTAT-1", sep = "-")

It is imortant to have 4 string's seperated by "-", but the unique cell identifier should be one of the last 2 strings for the heatmap labeling to format nicely.

Example

Here we'll show an example of running signals with a small dataset of 250 cells from the DLP platform. First we need to do some data wrangling to convert the allele_data table from long to wide format.

library(signals)
haplotypes<- format_haplotypes_dlp(haplotypes, CNbins)

Then we can call haplotype specific state per cell:

hscn <- callHaplotypeSpecificCN(CNbins, haplotypes)

Or alternatively allele specific states:

ascn <- callAlleleSpecificCN(CNbins, haplotypes)

See the vignette for more information on the differences between these two outputs.

After performing this inference, to QC the results it is useful to plot a different representation of the BAF. Here we plot the BAF as a function of the inferred states. The black lines indicate where we should see the mean BAF based on the state.

plotBAFperstate(hscn, maxstate = 10)

After having ensured the results make sense, you can plot a heatmap of the states across all cells with the following.

plotHeatmap(hscn, plotcol = "state_BAF")

This will cluster the cells using umap and hdbscan.

Utilities

signals includes a number of other utilities for analysing single cell (haplotype-specific) copy number including the following:

Even if you haven't used signals for allele specific copy number calling, if you format your dataframe to match hscn then most functions should work. Even if you just have total copy number, many of the utilities should also work. Please see the vignettes for more information.

Outputs

The main output is a dataframe with that is similar to the CNbins input file with the following additional columns:

state_phase has the following states: