niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

Finding the right cutoffs manually #20

Closed ColeKeenum closed 3 years ago

ColeKeenum commented 3 years ago

Hello,

Thanks for your work on this package, I think that this is a serious contribution to helping resolve a lot of issues that people are having with CITE-Seq and related single-cell techniques these days. I thought that other people might have had a similar issue as below, or that others will encounter this in the future.

I am trying to analyze a CITE-Seq dataset from mouse lung tissue that we generated with 10X. However, I cannot follow the procedure listed in your vignette because I do not see a third population of cells to the right-side of my histogram for the ADT data (shown below). Therefore, I am unsure how to identify a proper background for the ADT raw data. Could anyone provide guidance on how to define a background for proper analysis? Or have I performed improper sequencing?

raw_histogram

This is the same image, with a cutoff on the a-axis of 2.5:

zoomed_raw_histogram

I made these plots with the following code, adapted from the vignette:

prot <- readRDS("prot_merged.rds") #raw merged count matrices
rna <- readRDS("rna_merged.rds") #raw merged count matrices

rna_size = log10(Matrix::colSums(rna))
prot_size = log10(Matrix::colSums(prot))
ngene = Matrix::colSums(rna > 0)
mtgene = grep(pattern = "^mt-", rownames(rna), value = TRUE)
propmt = Matrix::colSums(rna[mtgene, ]) / Matrix::colSums(rna)
md = as.data.frame(cbind(propmt, rna_size, ngene, prot_size))
md$bc = rownames(md)

p1 = ggplot(md[md$rna_size > 0, ], aes(x = rna_size)) + geom_histogram(fill = "dodgerblue") + ggtitle("RNA library size \n distribution")
p2 = ggplot(md[md$prot_size> 0, ], aes(x = prot_size)) + geom_histogram(fill = "firebrick2") + ggtitle("Protein library size \n distribution")
cowplot::plot_grid(p1, p2, nrow = 1)
ggsave("raw_histogram.png", width = 10, height = 5)

p1 = ggplot(md[md$rna_size > 2.5, ], aes(x = rna_size)) + geom_histogram(fill = "dodgerblue") + ggtitle("RNA library size \n distribution")
p2 = ggplot(md[md$prot_size> 2.5, ], aes(x = prot_size)) + geom_histogram(fill = "firebrick2") + ggtitle("Protein library size \n distribution")
cowplot::plot_grid(p1, p2, nrow = 1)
ggsave("zoomed_raw_histogram.png", width = 10, height = 5)
MattPM commented 3 years ago

hi @ColeKeenum, thanks for using the method, those plots look good. It is a bit easier to estimate knowing the number of cells are you expecting to recover in your experiment?

** (ignore previous edit, I was looking at mRNA library size) If you use an x-axis cutoff from 1.5-5 (on the protein library plot), does the number of cells in the rightmost population match the number of cells you expected to recover?

If you are unsure about thresholds, you can also first use the filtered Cell Ranger output which contains only cells based on the 10X cell calling algorithm. Then, load the raw output, remove the cells defined by the filtered output from the raw output, and use the remaining filtered matrix as your background (I'd also remove the drops from the background matrix with prot lib size < 0.25)

ColeKeenum commented 3 years ago

I used your method based on the Cell Ranger output. I believe that there is a very high expressing "cell" population that is interfering with my clustering. This population seems to express essentially all ADTs in my staining panel, and I will run an outlier analysis to see if these can be removed.

Otherwise, Seurat V4 WNN UMAP projection seems to work well with the dsb normalization method. I will post here my results.

ColeKeenum commented 3 years ago

@MattPM, I was able to do a clustering with WNN Seurat V4 using ADT reads normalized with your method. However, I have noticed that some antibodies in my panel are poor quality. For example, my CD3 antibody is poor, but my CD4 and CD8a antibodies seem to work well. See below for a plot showing expression of ADTs (green) vs. gene (blue).

ForGithub_04062021

Do you think that some of my expression levels are too high? You will see that some antibodies have expression values up to 40 or 60, while others are at maximum less than 10.

You recommended above using a prot lib size < 0.25, but based on my histograms above, I am thinking of using a cutoff of 0.5 or 1.0 instead because these high expression values seem to suggest to me that I am underestimating my background droplet signal.

messersc commented 3 years ago

Hi Cole and Matt,

I do not want to hijack the thread, but the tips here have been very helpful for me so far. Using the information which barcodes cellranger thinks are good helped a lot when defining populations, as our data also showed 2 peaks instead of 3.

For example see this plot:

image

Especially the last plot can be used to make the cutoffs. I used all blue points (barcodes in the filtered_bc files from cellranger) as positives and the dark blue points to the left of those as the background while ignoring everything smaller ~2 prot_size. This seems to give us good results.

Regarding Cole's last question: The CD3-ADT does indeed look suspicious. Although for the rest, I have to say that the correlation between RNA and protein looks very good. But maybe you can try to find a better cutoff if you plot rna_size vs prot_size for your data.

Kind regards, Clemens

ColeKeenum commented 3 years ago

@messersc Thanks for the suggestion!

I had been using a ngene cutoff of 80, and a prot_size of >1.5 to define the low-quality droplets: background_drops = md[md$prot_size > 1.5 & md$ngene < 80, ]$bc

The results are similar to the ones that I showed above... but cleaner, so you were right.

Also, the CD3 antibody issue has been confirmed by Biolegend to be seen by other groups, including my own. I hope that this helps somebody out there!

MattPM commented 3 years ago

Thanks for the helpful discussions @messersc and @ColeKeenum. Since the cell ranger cell calling algorithm has improved from a couple years ago and most people are using Cell Ranger, I may update the documentation to highlight / recommend using the method above where the filtered output is used to define the cells as the default. One should still run quality control on cells after selecting them such as shown in this paper: https://pubmed.ncbi.nlm.nih.gov/31217225/ exact cutoffs are dataset and technology-specific and the way the documentation for dsb is written is very simple and designed to be generalizable / inclusive of other alignment tools like cite-seq-count.

One thing to be aware of if using the filtered output is you want the expect-cells argument from Cell Ranger to match the number of cells you expect to recover based on your experiment. This depends on the experiment and whether you are multiplexing samples / superloading the instrument.

expect-cells: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count

see the section "Calling Cell Barcodes" https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview

Too low a value here could cause the cell calling algorithm to underestimate the cells in the filtered output (as an extreme example if using the default value of 3000 on an experiment where you sample multiplexed and super-loaded 40,000 cells the algorithm will likely under call cells). That's why the dsb documentation recommended just using the raw output and thresholding however as you have illustrated above, these thresholds can be experiment / panel specific and not obvious. Recommending using the filtered output with the caveat about the expect-cells argument could make it easier for most users.

MattPM commented 3 years ago

Thanks again for the discussion, as these questions have come up from multiple users in the past, I overhauled the documentation to make things more clear, please check it out in the main readme.

Since most users are using Cell Ranger, I made the default preprocessing pipeline use the cells called by Cell Ranger's droplet filtering method (the filtered output), which the EmptyDrops method (Lun et al). Background is then estimated from the raw and both background and cell containing matrices are further QCd prior to dsb.

ColeKeenum commented 3 years ago

Thank you so much for the updated information! I will be using the denoising feature and also checking to remove any unused proteins (I have already been doing this). Based on the updated documentation, I did not realize that PCA may not be necessary for clustering the DSB values. I will try this on my dataset and report back.