niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

cellranger Vs CITE-seq-count #5

Closed sorjuela closed 4 years ago

sorjuela commented 4 years ago

Hi @MattPM ,

I'm just coming back to your package, and I noticed in your vignette you have:

# read in HTO data (output from citeseq count)
hto_data = Read10X('citeseq_count/HTO_umi_40k/', gene.column=1)

# read in ADT data (output from citeseq count)
adt_data = Read10X('citeseq_count/ADT_umi_40k/', gene.column=1)

# read in mRNA data (output from 10x cellranger)
rna_data = Read10X(data.dir = "cellranger_count/raw_feature_bc_matrix/")

Can't I just use all the ADT+HTO+RNA outputs from cellranger? Or is there a reason why I shouldn't align ADT with cellranger?

MattPM commented 4 years ago

Hi, The short answer is yes, the ADT+HTO output from cellranger should work and make the workflow simpler but I have not tested it. It is still a good idea to use the raw output from cellranger so you retain the background signal. A couple years ago 10X did not yet support CITE-seq and we have since stuck with the CITEseq-count pipeline which works well.

One thing to note, there are more (1.6M vs ~700k) total possible cell barcode sequences for the 10X V3 chemistry vs the V2, so there is an increased number of theoretical empty droplets from which to estimate the ADT background signal. The DSB normalization works well with both versions (we have tested this). How much the background droplets are filtered and the sequencing depth can influence the absolute magnitude of the dsb normalized values, since normalized values are defined as s.d. from the background mean. A good place to start is to filter out the barcodes that have no reads from the raw output, then estimate the background distribution from there. Let me know how the normalization works happy to assist.

sorjuela commented 4 years ago

Hi @MattPM , thanks for the answer! Also regarding the vignette, in the section "How to get empty drops without cell hashing or sample demultiplexing?" I define the negative object, and then I extract the ADT count matrix right? I can select my negative object by running emptyDrops on the RNA, or use the code you have: umi = SeuratObject$nUMI. (1) This last one I'm not sure where it comes from. (2) The assumption is that if I select barcodes with background RNA, these will also have background ADT?

MattPM commented 4 years ago

Hi @sorjuela

I see how that part of the vignette might not be so clear, thanks.

I define the negative object, and then I extract the ADT count matrix right?

That's correct.

The idea I'm trying to get across in that part of the vignette is if someone doesn't have hashing / multiseq / other sample barcoding data (which provides a simple way to define a "negative" droplet) one can use the mRNA data only as an orthogonal measurement from the protein data to assess whether a droplet has an actual cell or not.

Often fixed thresholds are used for QC, say removing any cells containing less than 500 unique mRNA or less than 1500 UMI. You wouldn't want to load the raw output from cellranger remove drops with < 500 genes and use all those droplets as the negative object since droplets with e.g. 490 unique genes are probably a decent cell as opposed to background. I just give an example in the vignette where I use droplets with more than 5 standard deviations below the mean in terms of mRNA as the definition of background. In reality one would want to look at

hist(seurat_object$nUMI); # same for sce object

to set a cutoff on cells that got removed from the data but are probably cells as opposed to background "soup" of ambient mRNA and protein. EmptyDrops would theoretically work too.

sorjuela commented 4 years ago

Ok, thanks for your answer. I think I will try out different ways to define the negative-cells and see how that works out.