swolock / scrublet

Detect doublets in single-cell RNA-seq data
MIT License
131 stars 74 forks source link

input data with pearson residuals workflow #49

Open brainfo opened 1 year ago

brainfo commented 1 year ago

Hi, nice work! If I would to use pearson residuals to normalize and find hvgs, I'm wondering when use other external apis such as scrublet. currently in the doc:

Works best if the input is a raw (unnormalized) counts matrix from a single sample or a collection of similar samples from the same experiment. This function is a wrapper around functions that pre-process using Scanpy and directly call functions of Scrublet(). You may also undertake your own preprocessing, simulate doublets with scanpy.external.pp.scrublet_simulate_doublets(), and run the core scrublet function scanpy.external.pp.scrublet.scrublet().

Usually I did scrublet on separate samples' raw counts using the default wrapper (of log normalization, pca embedding and hvg finder). How do you suggest me to do? Should I try using scrublet after normal filtering and pearson residuals' norm and hvg steps?

btw, the pearson residuals do normalization considering the batches. Using scrublet means I need to use pearson residuals separately on each batch of dataset before concat them?

Best,

jlause commented 1 year ago

Hi!

I‘ve implemented the Pearson residuals in scanpy, but I don’t know scrublet so well. After quickly glancing over the paper I thought that in principle Scrublet should be able to operate on any PCA space to generate simulated doublets and then build up the kNN graph for doublet detection.

to preprocess from raw counts to PCA with Pearson residuals, you could use the wrapper at scanpy.experimental.pp.recipe_pearson_residuals, which will normalize, select HVGs and run PCA

However, I am not sure how to get the scrublet implementation here to use the PCA space directly - not sure if one can feed scanpy.external.pp.scrublet_simulate_doublets() with just PCA coordinates and continue from there... from the extended methods in the preprint it seems as if that should be possible: the simulated doublets are defined in terms of PCA coordinates there..

This is all I can say for now, happy to dig deeper next week if needed, but Maybe some scrublet maintainers can help here better than I can :)

best, jan

brainfo commented 1 year ago

With one single batch test data from 10x (We loaded 18,000 nuclei):

  1. Running Scrublet ###no log transform###
    filtered out 12596 genes that are detected in less than 3 cells
    normalizing counts per cell
    finished (0:00:00)
    extracting highly variable genes
    finished (0:00:00)
    --> added ###btw, these are not added in the end so need other ways to read out the values###
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
    normalizing counts per cell
    finished (0:00:00)
    normalizing counts per cell
    finished (0:00:00)
    Embedding transcriptomes using PCA... ###30 PCA###
    Detected doublet rate = 16.4%
    Estimated detectable doublet fraction = 47.5%
    Overall doublet rate:
    Expected   = 5.0%
    Estimated  = 34.5%
    Scrublet finished (0:00:04)

    simulated and observed doublet scores: image

  2. filtered out 12596 genes that are detected in less than 3 cells -> sc.experimental.pp.highly_variable_genes(top 2k) on raw layer -> sc.experimental.pp.normalize_pearson_residual to p_res layer -> sc.external.pp.scrublet_simulate_doublets on p_res layer -> sc.external.pp.scrublet using returned simulated adata:

    Running Scrublet
    Embedding transcriptomes using PCA... ###30 PCA###
    Detected doublet rate = 0.1%
    Estimated detectable doublet fraction = 82.0%
    Overall doublet rate:
    Expected   = 5.0%
    Estimated  = 0.2%
    Scrublet finished (0:00:10) 

    image This dataset has 15 columns that have less than 1k counts and whether to remove them doesn't affect the doublet detecting results much.

I think scrublet is sensitive to embedding and in your benckmarking, pearson residuals with standard PCA could present small, distinct subpopulations better, thus (may) increase the detected neotypic doublets. But that's not the case here.

btw, my colleague who's using r, said the doublets finder works well after SCTransform in r. I will test this on this sample next week probably.

Or just something wrong with my preprocessing. Or I just don't need this since default scrublet works on my data.