signals (single cell genomes with allele specificity) is a tool to estimate allele and haplotype specific copy number states in single cells with low coverage (0.01-0.1X). signals phases alleles based on losses and gains across all cells and then assigns allele specific states for each bin in each cell using a hidden markov model. Documentation is available here.
You can read more about signals in our paper.
You can install signals with the following command
devtools::install_github("shahcompbio/signals")
A docker image is available here.
signals
was developed to work with Direct Library Preperation + (DLP+) data. A high throughput single cell whole genome sequencing workflow, described in Laks et al.. As such it works using the output of the the mondiran pipeline developed to process this type of data, available here. Despite being developed with this type of data and pipeline in mind, it should work well with other single cell whole genome technologies. We have run it successfully with 10X CNV data for example. The required inputs are total copy number estimates in bins across the genome and haplotype block counts per cell (SNPs can also work). See the test datasets provided with the package for example inputs. If you have a different type of technology and would like some advice or help running signals please open an issue. We describe in more detail the necessary input below.
You will need the HMM copy results table (CNbins
) with the following columns: chr
, start
,end
, cell_id
, state
, copy
. state
is the inferred total copy number state. copy
values are GC-correceted, ploidy corrected normalized read counts (this is what HMMcopy uses to infer the total copy number states). You will also need the cell specific haplotype counts (allele_data
) as outputted by the inferhaps
and couthaps
sub commands in the pipeline. This includes the following columns: chr
, start
,end
, cell_id
, hap_label
, allele_id
, readcount
. allele_id
identifies the 2 alleles. hap_label
is a label given to the haplotype block, these labels are consistent across cells, and in combination with chr
gives each block a unique ID. These blocks are computed by finding SNP's that can be confidently phased together. See the section of "phasing uncertainty" in SHAPEIT here. This notion of phasing uncertainty is also used in remixt. Note that this notion of haplotype block is different to other tools such as CHISEL where it is assumed that the the phasing remains consistent over seem specified length such as 50kb. Finally the readcount
columns gives the sum of the number of reads for each allele in each block.
Other technologies and software should also be compatible with signals. For example, we have used 10X data successfully. If you have single cell bam files or fastq files see the detailed documentation for running our single cell pipeline here. Alternatively, we provide a lightweight snakemake pipeline with the key steps here. Also included there are some scripts to demultiplex 10X CNV bams.
If you total copy number calls from other software such as 10X cellranger or scope, these may also work but not something we have tried. Feel free to open an issue if you need some advice.
Some of the plotting tools assume that the cell_id
's conform to the following naming conventions
{sample}-{library}-{R*}-{C*}
Here R and C refer the rows and columns of the chip used in the DLP protocol. If you're using another technology and your cells are named differently, we would reccomend renaming your cell's for easy compatibility. For example, if you have 10X data where cell_id's are barcodes that look like CCGTACTTCACGGTAT-1
something like this would work
new_cell_id <- paste("mysample", "mylibrary", "CCGTACTTCACGGTAT-1", sep = "-")
It is imortant to have 4 string's seperated by "-", but the unique cell identifier should be one of the last 2 strings for the heatmap labeling to format nicely.
Here we'll show an example of running signals with a small dataset of 250 cells from the DLP platform. First we need to do some data wrangling to convert the allele_data
table from long to wide format.
library(signals)
haplotypes<- format_haplotypes_dlp(haplotypes, CNbins)
Then we can call haplotype specific state per cell:
hscn <- callHaplotypeSpecificCN(CNbins, haplotypes)
Or alternatively allele specific states:
ascn <- callAlleleSpecificCN(CNbins, haplotypes)
See the vignette for more information on the differences between these two outputs.
After performing this inference, to QC the results it is useful to plot a different representation of the BAF. Here we plot the BAF as a function of the inferred states. The black lines indicate where we should see the mean BAF based on the state.
plotBAFperstate(hscn, maxstate = 10)
After having ensured the results make sense, you can plot a heatmap of the states across all cells with the following.
plotHeatmap(hscn, plotcol = "state_BAF")
This will cluster the cells using umap and hdbscan.
signals
includes a number of other utilities for analysing single cell (haplotype-specific) copy number including the following:
Even if you haven't used signals
for allele specific copy number calling, if you format your dataframe to match hscn
then most functions should work. Even if you just have total copy number, many of the utilities should also work. Please see the vignettes for more information.
The main output is a dataframe with that is similar to the CNbins input file with the following additional columns:
A
A allele copy numberB
B allele copy numberstate_AS_phased
A|Bstate_min
Minor allele copy numberLOH
=LOH if bin is LOH, NO otherwisestate_phase
Discretized haplotype specific states (see below)phase
Whether the A allele or B allele is dominantalleleA
Counts for the A allelealleleB
Counts for the B alleletotalcounts
Total number of countsBAF
B-allele frequency (alleleB / totalcounts)state_phase
has the following states:
A-Hom
A allele is homozygous, ie LOH of B-AlleleB-Hom
B allele is homozygous, ie LOH of A-AlleleA-Gained
A allele is gained but B not lost (A > B)B-Gained
B allele is gained but A not lost (B > A)Balanced
A == B