satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 906 forks source link

Can we do cell filtering using sample-specific threshold (for example using percentiles)? #3396

Closed yilingmialiu closed 4 years ago

yilingmialiu commented 4 years ago

Hi:

I'm always struggling to find a less subjective way to do cell filtering? Say, if i have 14 samples in an experiment, can I set sample-specific thresholds for quality metrics like nFeature_RNA, mitochondrial % and ect.? Also any thoughts on the idea of setting threshold based on the percentile of each sample instead of a universal threshold across the experiment?

Thanks, Mia

akramdi commented 4 years ago

Hi Mia,

I've been thinking about sample specific thresholds for a while. Here's how I'm proceeding:

Sometime you may come across a dataset with very good coverage, in that case you may remove good quality cells by setting a distribution-based threshold. To avoid this, I choose not set a lower threshold if a coverage distribution starts at 1000UMI.

Here's my code:


 minCov=1000 #if a sample has a good coverage (>=minCov), then don't set a lower thresold for nCount, it's already pretty good. 
 if(min(sobj$nCount_RNA)>=minCov){
    countLOW=min(sobj$nCount_RNA)
  }else{
    countLOW=quantile(sobj$nCount_RNA, prob=c(0.01))  
 }
 countHIGH=quantile(sobj$nCount_RNA, prob=0.99) 
 featureLOW=quantile(sobj$nFeature_RNA, prob=0.01)

##subset
sobj <- subset(sobj, subset = nFeature_RNA > featureLOW & nCount_RNA > countLOW  & nCount_RNA < countHIGH & percent.mt < 20)

Hope it helps! Amira

yilingmialiu commented 4 years ago

Thank you so much for your kind advice @akramdi. Do sample-specific filtering according to nFeatures is what I'm doing now.

benjytan88 commented 4 years ago

Hi Mia,

I have been using sample-specific thresholds since the beginning.

For perc.mito, I'm a bit more stringent and used 10% as my cutoff. Anything above that is removed.

For nCounts and nFeatures, I adopted a statistical approach where I accepted all data which lies between the mean +/- 2 SD of the distribution. Here's my code:

## Get filtering parameters
count.max <- round(mean(so$nCount_RNA) + 2 * sd(so$nCount_RNA), digits = -2)
count.min <- round(mean(so$nCount_RNA) - 2 * sd(so$nCount_RNA), digits = -2)
feat.max <- round(mean(so$nFeature_RNA) + 2 * sd(so$nFeature_RNA), digits = -2)
feat.min <- round(mean(so$nFeature_RNA) - 2 * sd(so$nFeature_RNA), digits = -2)

## Set minimum parameters to 0 if negative value
if (count.min < 0){
   count.min <- 0
} else {
   count.min <- count.min
}

if (feat.min < 0){
   feat.min <- 0
} else {
   feat.min <- feat.min
}

## Filter cells
newSO <- subset(so, subset = nFeature_RNA > feat.min & nFeature_RNA < feat.max & nCount_RNA < count.max & nCount_RNA > count.min & perc.mito < 10)

Hope it helps! Cheers!

yilingmialiu commented 4 years ago

@nicodemus88 Thank you so much for the reply. Sorry maybe this is a dumb question but did you use a global count.max, count.min, feat.max, feat.min for all samples, as in, does so an object containing all samples you have?

I did use a very relaxed threshold for percent.mito and tried to filter out some cells with super high ribosomal expression since it's always the problem kid keeps popping up in PCs and DGE.

Below is the code I end up using based on advice from @akramdi

##filter out cells with high mitochondrial and ribosomal percentage sc <- subset(sc, subset = ribo_perc <= 60) sc <- subset(sc, subset = percent.mito <= 25)

##filter out low/high counts with specific threshold for each sample sc$cell_filtering <- 1 sc_meta <- sc@meta.data sc_meta$orig.ident <- as.character(sc_meta$orig.ident) sc_meta_new <- NULL

for(sample_id in unique(sc_meta$orig.ident)) { sc_meta_sub <- sc_meta[sc_meta$orig.ident==sample_id,] minCov=1000 #if a sample has a good coverage (>=minCov), then don't set a lower thresold for nCount, it's already pretty good. if(min(sc_meta_sub$nCount_RNA)>=minCov){ nCount_RNA_low=min(sc_meta_sub$nCount_RNA) }else{ nCount_RNA_low=as.numeric(quantile(sc_meta_sub$nCount_RNA, prob=c(0.05)))
} print(paste0("lower ncount threshold: ",nCount_RNA_low)) nCount_RNA_high <- quantile(sc_meta_sub$nCount_RNA,0.95)

sc_meta_sub$cell_filtering[sc_meta_sub$nCount_RNA< nCount_RNA_low] <- 2 sc_meta_sub$cell_filtering[sc_meta_sub$nCount_RNA> nCount_RNA_high] <- 2 sc_meta_new <- rbind(sc_meta_new,sc_meta_sub) }

identical(rownames(sc_meta_new),rownames(sc@meta.data))

sc$cell_filtering <- sc_meta_new$cell_filtering

sc<- subset(sc, cell_filtering==1)

Any thoughts on this approach?

benjytan88 commented 4 years ago

@Kittenlovefish No, I did not use a global value for all samples. What I did was load in each sample individually and save them to a list. Then I loop the QC steps through the list. I only merge & integrate my samples after QC.

I noticed you're running in on only 1 Seurat object, so I assume all your samples are merged into 1 object, is it? Anyway, I don't see anything wrong with your approach... If the result looks OK to you (i.e. cells which you consider to be outliers and poor quality are removed), then it's fine... There's no hard rule to say which QC method is superior...

Cheers!

yilingmialiu commented 4 years ago

@nicodemus88 Yes, exactly. I merged them into one object at the very beginning. Can I please ask a question, just out of curiosity, when you merge & integrate your samples after QC, did you use seurat integration method (Seurat function FindIntegrationAnchors and IntegrateData) or simply put them together (my approach is similar to this way)?

benjytan88 commented 4 years ago

@Kittenlovefish I merged them using Seurat, then integrate them using Harmony. They are many batch-correction / integration methods out there, but how good each of them are depends on the dataset. In my case, I find that Harmony integrates my data better then Seurat.

Refer this paper for a benchmark on batch correction methods... https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9

samuel-marsh commented 4 years ago

Hi @nicodemus88, @Kittenlovefish, and all,

Just couple things to add but I think lot of the relevant points have been hit here and it really comes down to dataset specific factors and ultimately I think personal prefernce. I tend to prefer setting cutoff values the same across all samples in dataset because differences in sampling, treatment, genotype etc can cause changes that may be biologically relevant and so I prefer and thresholding different samples differently may exclude certain cells that could be important

First, in terms of mito threshold I would be cautious with stringent or "typical" values for mito cutoffs. It may be the case that a specific treatment, genotype, disease, etc may cause changes in mito % between samples that may actually be relevant information. So I tend to actually be fairly liberal and base my cutoff based on what distribution looks like for each dataset as a whole across all samples.

Second, is that I think you have to be careful with statistical based approaches (sd or mad) for finding cutoff values depending on the dataset. If you are working with dataset of same cell type or very similar cell types than this approach may be fine but it does not work well for datasets with cell types that have large variance between each other. Here is example: snRNAseq of human brain, 10X V3 data, Seurat for import, filtering, and plotting and LIGER for analysis and clustering. I calculated the sd as in code above when the data was first imported and then plotted that value on my final analysis data which was filtered with my own thresholds. example_data

The two plots below are either grouped by sample or by cell type. You can see while the cutoff may seem ok (although still little low) when grouped by sample, when you look at where that cutoff falls grouped by cell type it is an entirely different story. The cutoff actually falls nearly in the middle of the excitatory neuron distribution and if I had threshold there in the beginning I would have lost a ton of those cells and some inhibitory neurons although all the other cell types in the dataset would have been fine. The same obviously applies to low cutoffs as well. In addition to cell type differences this could also come into play again with changes that occur as result of treatment, genotype, disease, etc.

I think this is mainly one of those things where there isn't a one size fits all unfortunately and it's probably combination of subjective and objective decisions in dataset specific way that is best.

Hope this helps! Best, Sam

satijalab commented 4 years ago

I'll agree with Sam that while sample-specific thresholding is fine, its not easy to think of a fool-proof automated way to set these thresholds. Its one of thorny issues that does unfortunately involve care and exploration of each individual datasets. Automated solutions (like +/- 2 SD) usually return reasonable levels, but there are potential caveats.