hesselberthlab / scrunchy

R toolkit for the analysis of single-cell functional heterogeneity
https://scrunchy.hesselberthlab.org/
Other
2 stars 3 forks source link

Identify cell-specific repair events #2

Open jayhesselberth opened 5 years ago

jayhesselberth commented 5 years ago
mandylr commented 5 years ago

If you plot the no cell bulk data it looks like this.

bulk_haircut_plot_NO_cells.pdf

If you remove points from the cell data that are points that fall above the lower 95% confidence interval for the no cell data, it looks like this. So most of these background points go away, but not all. I bet there's a better way to identify and remove these points. I can send you R code in a bit.

bulk_haircut_plot_remove_noise_points.pdf

Both plots are in the Images to share 20181017 folder.

jayhesselberth commented 5 years ago

It sounds like you only used data from the background data to identify sites (I think this is what I suggested).

A different approach would be to compare counts site-by-site between the cell/no-cell data and simply ask if the two sets of counts are different by a (non-parametric?) test.

We really need to get this data into a form suitable for using reprex to share code and plots; I uploaded data and a load script into data-raw/ but the generated Seurat object is too big to include as data in the package.

kriemo commented 5 years ago

Here is a preprint that discusses a method to remove background (free) mRNA profiles from cell associated mRNA profiles. We could probably use a similar approach.

https://www.biorxiv.org/content/early/2018/04/20/303727

jayhesselberth commented 5 years ago

One option would be to start with the 10x barcode whitelist.

mandylr commented 5 years ago

I used the SoupX package to remove the background data. It worked pretty well, but there may be a better way to do this. We could use a similar strategy, but could simplify the process since our background signal is much higher than you expect from free mRNA contamination.

library(Seurat)
#> Loading required package: ggplot2
#> Loading required package: cowplot
#> 
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggplot2':
#> 
#>     ggsave
#> Loading required package: Matrix
library(SoupX)
library(tidyverse)
library(Matrix)

dataDirs = c("/Users/mandyricher/hesselberthlab/projects/10x_haircut/20181214/data/for_SoupX/cellranger3")
scl = load10X(dataDirs)
#> Loading data for 10X channel Channel1 from /Users/mandyricher/hesselberthlab/projects/10x_haircut/20181214/data/for_SoupX/cellranger3

#Inferring non-expressed repair positions
scl = inferNonExpressedGenes(scl)
#> Inferring non-expressed genes for channel Channel1
tstGenes = rownames(scl$channels$Channel1$nonExpressedGenes)[seq(20)]

## Estimating the contamination fraction using 20 markers
scl = calculateContaminationFraction(scl, "Channel1", list(IG = tstGenes))

#Calculating contamination for each cell using the top 20 sites picked by functions above
scl = interpolateCellContamination(scl, "Channel1", useGlobal = TRUE)

#Rhos are the measure of background pre site
head(scl$channels$Channel1$rhos)
#> Channel1___AAACCTGGTACCATCA Channel1___AAACCTGGTATGAATG 
#>                   0.5880831                   0.5880831 
#> Channel1___AAACCTGTCAACACGT Channel1___AAACGGGAGAGCAATT 
#>                   0.5880831                   0.5880831 
#> Channel1___AAACGGGAGTACGCGA Channel1___AAACGGGAGTTCGCAT 
#>                   0.5880831                   0.5880831

#Correcting data by removing counts using adjustCounts 
#This just adjusts the counts by subtracting "soup" signal
scl = adjustCounts(scl)

#Correcting data by removing counts using the strainCells
#This changes the scale of the data and removes fewer sites
scl = strainCells(scl)

cntSoggy = Matrix::rowSums(scl$toc > 0)
cntStrained = Matrix::rowSums(scl$strainedExp > 0)
cntAdjusted = Matrix::rowSums(scl$atoc > 0)

#Make plots of most zeroed positions in the data
z = as.data.frame(cntSoggy - cntAdjusted) 
x = as.data.frame(cntSoggy - cntStrained) %>%
        mutate(gene = rownames(.),
               sample = "strained") %>%
        as_data_frame()

zeroed = z %>% 
        mutate(gene = rownames(.),
               sample = "adjusted") %>% 
        as_data_frame() %>%
        full_join(x, by = c("gene", "sample", "cntSoggy - cntAdjusted" = "cntSoggy - cntStrained")) %>%
        separate(gene, into = c("hairpin", "position")) %>%
        mutate(position = as.integer(position)) %>%
        dplyr::rename(count = `cntSoggy - cntAdjusted`)

zeroed %>% ggplot(aes(x = position, y = count, color = sample)) + 
        geom_line() + 
        facet_wrap(~hairpin) + 
        geom_vline(xintercept = 44)

Most removed sites from data


# Make plots of 'bulk' signal after background sites are removed
s = Matrix::rowSums(scl$strainedExp) %>% as.data.frame()
a = Matrix::rowSums(scl$atoc) %>% as.data.frame()
r = Matrix::rowSums(scl$toc) %>% as.data.frame()

strained = s %>%
        mutate(gene = rownames(s),
               sample = "strained") %>%
        as_data_frame() %>%
        dplyr::rename(count = ".")

adjusted = a %>%
        mutate(gene = rownames(a),
               sample = "adjusted") %>%
        as_data_frame() %>%
        dplyr::rename(count = ".")

bulk = r %>% 
        mutate(gene = rownames(r),
               sample = 'raw') %>%
        as_data_frame() %>%
        dplyr::rename(count = ".") %>%
        full_join(strained) %>%
        full_join(adjusted) %>%
        separate(gene, into = c("hairpin", "position")) %>%
        mutate(position = as.integer(position))
#> Joining, by = c("count", "gene", "sample")
#> Joining, by = c("count", "gene", "sample")

bulk %>% ggplot(aes(x = position, y = count, color = sample)) + 
        geom_line() + 
        facet_wrap(~hairpin, scales = "free")

Bulk repair signal after background correction

The strained sample is on a different scale so it doesn't show up on these plots

Created on 2018-12-29 by the reprex package (v0.2.1)