niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

Using emptyDrops on RNA to define my negative matrix #8

Closed sorjuela closed 4 years ago

sorjuela commented 4 years ago

Hi @MattPM, it's me again.

So I was using emptyDrops on the RNA to define my negative matrix, under the assumption that if a droplet is considered empty in RNA, the level of ADT in this droplet would be noise, right?

I wanted to show you how this looks like for me. In the plot we see total counts of ADT on x, and total counts of RNA on y. Before plotting I remove total counts of ADT < 10. Then I run emptyDrops on the RNA and I get 3 categories: in green (NA FDR) are all barcodes <100 total counts, in red barcodes that are considered empty, in blue barcodes with a cell in them. My problem with this is that the filtering is fine for RNA (green and red points would be removed), but if I remove those same cells in ADT I'm losing cells with a good total count (I think). I was wondering if this is normal in the data you have seen, and if in fact this peak of cells with low RNA and high-ish ADT should in fact be considered background. RNAvsADT_nUMI

Thanks for your help, Stephany

MattPM commented 4 years ago

but if I remove those same cells in ADT I'm losing cells with a good total count (I think).

Can you explain which part of the plot are the cells you are concerned with losing?

In a given library, usually there are a couple peaks in the umi distribution for proteins and mRNA that may not even be correlated across all cells highly depending on the cell type (dendritic cells have a ton or RNA umi for example). The negative / empty drops can also have variation in ADT depending since different antibodies contribute different background. I wouldn't conclude much from looking at this alone. Do you happen to know where you are at for ADT sequencing saturation?

It might be worth doing one round of dsb normalization with a conservative estimation of background (you could even just use the green cells above) to get a feel for how it performs on the data, compared e.g. to CLR and to see if cells that are negative for the protein are centered at 0.

sorjuela commented 4 years ago

Hi @MattPM,

Can you explain which part of the plot are the cells you are concerned with losing?

I'm not sure if I'm concerned with losing, more like concerned with using this as background: the orange peak in log10(ADT) > 2, in the green area (log10(RNA) < 2). Because this has the same ADT count as the other peak in log10(RNA) > 2.

Do you happen to know where you are at for ADT sequencing saturation?

No not really.

Thank you so much for your help!

MattPM commented 4 years ago

I think that is the background. There is almost no RNA in those drops. Take a look at the raw data from the CITE-seq Seurat vignette if you want to get a feel on some other data (look at B cell and T cell lineage proteins in monocytes for example, the counts are nowhere near zero; that is the background we are trying to normalize out). I'd try normalizing, clustering (maybe with proteins if that makes sense, not sure about what cells you have) and seeing what the distributions of all proteins look like across the clusters

Again it is hard to tell globally just looking at n UMI plots. If you want to qc more you can take those that are theoretically negative and for each protein calculate the 25th 50th 75th percentile and compare the empty drops and the cells, but I don't think that is necessary.

FWIW, you may not be very close to saturation on the protein sequencing, this is the sums (nUMI ADT) of raw protein data for some t cells (no negatives, just singlet cells) for a library closer to saturation:

hist_sum

As you increase the sequencing depth you get better estimation of proteins on your cells of course, but also the background. The normalization should still work.