Open jayhesselberth opened 5 years ago
If you plot the no cell bulk data it looks like this.
bulk_haircut_plot_NO_cells.pdf
If you remove points from the cell data that are points that fall above the lower 95% confidence interval for the no cell data, it looks like this. So most of these background points go away, but not all. I bet there's a better way to identify and remove these points. I can send you R code in a bit.
bulk_haircut_plot_remove_noise_points.pdf
Both plots are in the Images to share 20181017 folder.
It sounds like you only used data from the background data to identify sites (I think this is what I suggested).
A different approach would be to compare counts site-by-site between the cell/no-cell data and simply ask if the two sets of counts are different by a (non-parametric?) test.
We really need to get this data into a form suitable for using reprex to share code and plots; I uploaded data and a load script into data-raw/
but the generated Seurat object is too big to include as data in the package.
Here is a preprint that discusses a method to remove background (free) mRNA profiles from cell associated mRNA profiles. We could probably use a similar approach.
One option would be to start with the 10x barcode whitelist.
I used the SoupX package to remove the background data. It worked pretty well, but there may be a better way to do this. We could use a similar strategy, but could simplify the process since our background signal is much higher than you expect from free mRNA contamination.
library(Seurat)
#> Loading required package: ggplot2
#> Loading required package: cowplot
#>
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggplot2':
#>
#> ggsave
#> Loading required package: Matrix
library(SoupX)
library(tidyverse)
library(Matrix)
dataDirs = c("/Users/mandyricher/hesselberthlab/projects/10x_haircut/20181214/data/for_SoupX/cellranger3")
scl = load10X(dataDirs)
#> Loading data for 10X channel Channel1 from /Users/mandyricher/hesselberthlab/projects/10x_haircut/20181214/data/for_SoupX/cellranger3
#Inferring non-expressed repair positions
scl = inferNonExpressedGenes(scl)
#> Inferring non-expressed genes for channel Channel1
tstGenes = rownames(scl$channels$Channel1$nonExpressedGenes)[seq(20)]
## Estimating the contamination fraction using 20 markers
scl = calculateContaminationFraction(scl, "Channel1", list(IG = tstGenes))
#Calculating contamination for each cell using the top 20 sites picked by functions above
scl = interpolateCellContamination(scl, "Channel1", useGlobal = TRUE)
#Rhos are the measure of background pre site
head(scl$channels$Channel1$rhos)
#> Channel1___AAACCTGGTACCATCA Channel1___AAACCTGGTATGAATG
#> 0.5880831 0.5880831
#> Channel1___AAACCTGTCAACACGT Channel1___AAACGGGAGAGCAATT
#> 0.5880831 0.5880831
#> Channel1___AAACGGGAGTACGCGA Channel1___AAACGGGAGTTCGCAT
#> 0.5880831 0.5880831
#Correcting data by removing counts using adjustCounts
#This just adjusts the counts by subtracting "soup" signal
scl = adjustCounts(scl)
#Correcting data by removing counts using the strainCells
#This changes the scale of the data and removes fewer sites
scl = strainCells(scl)
cntSoggy = Matrix::rowSums(scl$toc > 0)
cntStrained = Matrix::rowSums(scl$strainedExp > 0)
cntAdjusted = Matrix::rowSums(scl$atoc > 0)
#Make plots of most zeroed positions in the data
z = as.data.frame(cntSoggy - cntAdjusted)
x = as.data.frame(cntSoggy - cntStrained) %>%
mutate(gene = rownames(.),
sample = "strained") %>%
as_data_frame()
zeroed = z %>%
mutate(gene = rownames(.),
sample = "adjusted") %>%
as_data_frame() %>%
full_join(x, by = c("gene", "sample", "cntSoggy - cntAdjusted" = "cntSoggy - cntStrained")) %>%
separate(gene, into = c("hairpin", "position")) %>%
mutate(position = as.integer(position)) %>%
dplyr::rename(count = `cntSoggy - cntAdjusted`)
zeroed %>% ggplot(aes(x = position, y = count, color = sample)) +
geom_line() +
facet_wrap(~hairpin) +
geom_vline(xintercept = 44)
# Make plots of 'bulk' signal after background sites are removed
s = Matrix::rowSums(scl$strainedExp) %>% as.data.frame()
a = Matrix::rowSums(scl$atoc) %>% as.data.frame()
r = Matrix::rowSums(scl$toc) %>% as.data.frame()
strained = s %>%
mutate(gene = rownames(s),
sample = "strained") %>%
as_data_frame() %>%
dplyr::rename(count = ".")
adjusted = a %>%
mutate(gene = rownames(a),
sample = "adjusted") %>%
as_data_frame() %>%
dplyr::rename(count = ".")
bulk = r %>%
mutate(gene = rownames(r),
sample = 'raw') %>%
as_data_frame() %>%
dplyr::rename(count = ".") %>%
full_join(strained) %>%
full_join(adjusted) %>%
separate(gene, into = c("hairpin", "position")) %>%
mutate(position = as.integer(position))
#> Joining, by = c("count", "gene", "sample")
#> Joining, by = c("count", "gene", "sample")
bulk %>% ggplot(aes(x = position, y = count, color = sample)) +
geom_line() +
facet_wrap(~hairpin, scales = "free")
The strained sample is on a different scale so it doesn't show up on these plots
Created on 2018-12-29 by the reprex package (v0.2.1)