niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

CiteSeqCount on empty droplet whitelist #29

Closed EspressoKris closed 2 years ago

EspressoKris commented 3 years ago

Hi,

I am trying perform dsb normalisation on my ADTs. However, I am encountering an issue when it comes to perform CiteSeqCount on the whitelists of the empty droplets (excluded filtered barcodes from CellRanger). Basically, the analysis is stuck at testing cell barcode collapsing thresholds. This happens on samples that have 2.5 to 3 million barcodes in their empty droplets.

Any suggestion? I may be doing something wrong.

Thanks

MattPM commented 3 years ago

Hi, so this a bit more of a CITEseq-count question upstream of dsb but I'm happy to help out.

Can you display the command you used for CiteSeqcount?

So I understand, you ran Cell Ranger on the RNA data, from that used the filtered output to get cell containing droplet barcodes, then tried to run CiteSeqCount using a whitelist consisting of what you think are empty drops only without those cells? If so, I don't think that will work–the simplest way to use CITE-seq count is to not use a whitelist of barcodes: just specify a high number for the -cells parameter: try an order of magnitude higher than the number of droplets you expected to recover (or 10 x the number of cells estimated by Cell Ranger). You don't need to provide a whitelist. Basically tell CiteSeqCount that you expect a ton of cells and it will align the top ranked barcodes retaining all the cells and the major background signal. https://hoohm.github.io/CITE-seq-Count/Running-the-script/ -- see the -cells argument.

Also, what antibodies are you using? You may be able to align your ADTs with Cell Ranger which would be simpler (that is basically shown in the vignette).

EspressoKris commented 3 years ago

Sorry for the late response. You are right. I have been trying to run CiteSeqCount on a whitelist of barcodes which I expect to come from the empty beads fraction. The code I used is the following: CITE-seq-Count -R1 Sample_combined_R1_001.fastq.gz -R2 Sample_combined_R2_001.fastq.gz -t Tags.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -cells Number_of_cells_in_whitelist -wl Empty_whitelist_Sample.csv -T 96 I originally thought this would be ideal as the Raw barcodes from CellRanger do include the filtered ones, which would artificially increase the noise.

So given that the number of 'good' cells range between 3-10k, should I just set -cells to around 1M (10-100x the number of filtered barcodes)? I will give it a try and see what happens.

For this experiment I used BioLegend TotalSeq-A antibodies. I tried to map the ADTs using CellRanger but the efficiency is much smaller than CiteSeqCount, which provide optimal results (which also make sense). Thanks

gt7901b commented 3 years ago

@MattPM A related question. What do you think is the optimal negative/empty_droplet number to use for dsb? negative from the HTODemux output is much smaller than the difference between raw and filtered feature bc ,i.e. background = setdiff(colnames(raw$Gene Expression), stained_cells). what do you think is the good number of background_drops? so that to use proper cutoffs? background_drops = md[md$prot_size > 1.5 & md$prot_size < 3 & md$ngene < 100, ]$bc negative_mtx_rawprot = as.matrix(prot[ , background_drops])

Thanks

MattPM commented 3 years ago

So given that the number of 'good' cells range between 3-10k, should I just set -cells to around 1M (10-100x the number of filtered barcodes)? I will give it a try and see what happens.

@kristiangu I would go with perhaps 100,000 instead of 1M. That will automatically align the top 1e5 barcodes and should retain the major background signal if you loaded 10K cells. For a robustness check you could try 1M on a single lane of 10X to test how different the output is but that will be overkill for the number of cells you expect to recover. 1M will contain a lot of barcodes that were not detected. The increased barcode length in the recent versions of 10x kits (compared to the older V2 kits for example) will have more "hypothetical" droplets that are barcodes that went completely undetected in the experiment. Increased length of barcode = more hypothetical droplets

MattPM commented 3 years ago

@gt7901b

If you want to take a deep dive into the differences between using hashing negative vs negatives defined from library size distribution from the aligned data and how using each impacts the actual normalized counts, take a look at the Supplementary Note of the preprint–it did not impact the normalized counts in the data we tested.

As this thread is alluding to, what the 'right' number of background droplets to use will depend on the experiment and the number of cells you expect to recover. If you are using Cell Ranger, I would use 10X the number of recovered cells from the raw output subset to not include cells or those with high mRNA as shown in the vignette on CRAN.

The exact cutoffs for mRNA content or library size for the empty drops depend on your data however the thresholds we show as a guide (e.g. less than 80 unique mRNA to be considered an empty drop) are a good stating point and work on most datasets.

@grothja @kristiangu Note that as long as you retain the major background peak, the normalized values will be very interpretable even if there is bin modality in the total library size distribution across droplets.

gt7901b commented 3 years ago

@MattPM Thanks for the insight. When I use HTODemux negative as empty droplets from filtered feature barcode matrix, I have ~1K vs 8K positive cells. Then I tried a hybrid approach, still use HTODemux singlet as positive from filtered feature barcode matrix, but use negative_mtx_rawprot processed from raw feature barcode matrix, now I got ~72K empty droplets. So the dsb function will be like cell_protein_matrix = pos_adt_matrix, empty_drop_matrix = negative_mtx_rawprot, It took longer to run and with Warning message in x - mu_u: “longer object length is not a multiple of shorter object length”. but compared to ~1K empty droplets. 72K empty droplets gave me greater range, for Antibody X, the range increased from [0,2] to [0,20]. Do you think it is a valid?

Another question is regarding the dead/dying cells, do you remove dead cells identified by RNA-seq data before dsb normalization or it does not matter? Thanks for your time

MattPM commented 3 years ago

@gt7901b The function should not take that much longer with different sized background matrix. The warning you're reporting sounds like your matrices are not the same dimensions--check if your background and cell matrix have the same number of rows? (same number of proteins in the same positions) all.equal(nrow(pos_adt_matrix), nrow(negative_mtx_rawprot))

^That is important so you want to look into what is causing that error, if you can't find it I can help if you send your data.

The range you're reporting ~0-20 is typical for dsb values.

Overall, the hybrid approach you're describing sounds better to me.

FYI Re: "When I use HTODemux negative as empty droplets from filtered feature barcode matrix" This isn't listed in any of Seurat's tutorials, but if you run HTODemux on only the filtered barcodes, these are highly enriched for cells only. The function can improve somewhat if you use more drops (i.e. the top 50-80k barcodes based on total protein counts for each 10X lane) because the function is trying to find empty / negative drops with low HTO staining (by setting K = 1+ nHTO in the k medoids function). If you only run HTODemux on what CellRanger thinks is a cell, then you will have a tiny number of negatives detected by HTODemux (like the 1000 you are reporting--this is too few negatives I think). However, what you did with your hybrid approach is definitely fine, all you're trying to do is get the major background population which should be really clear if you plot the total log protein counts across all droplets (from the raw matrix) with non-zero ADT counts as in the tutorial.

gt7901b commented 3 years ago

many thanks, @MattPM . “longer object length is not a multiple of shorter object length” is caused by difference of one Ab. Because one of the negative control Ab has zero counts in negative_mtx_rawprot (there are two negative ctrl Abs in all), I remove that Ab from negative_mtx_rawprot. despite warning, the dsb still finished normalization. Is that fine?

MattPM commented 3 years ago

@gt7901b No the simplest thing to do if there are proteins without data, even if control within the background drop matrix you must remove them from both the cell and background matrix. That error means you’re correcting the mean and variance from the wrong antibodies so it’s important.

Thanks very much for bringing this to my attention, it’s a corner case but I will make the function stop if the two input matrices do not have matched rows; that’s quite important.

gt7901b commented 3 years ago

@MattPM thanks for the advice. I fixed that and it works very fast. What about dead/dying cells? Will removing dead cells before dsb affect the result?

MattPM commented 3 years ago

@gt7901b It’s better to remove low quality / dead cells before running the function, that’s basically the idea in the vignette; run QC with ADT and mRNA based metrics to get the high quality cells first then normalize.