Any interest in getting SCTransform up on Scanpy?

atarashansky commented 3 years ago

[ ] Additional function parameters / changed functionality / changed defaults?
[ ] New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
[ ] New plotting function: A kind of plot you would like to seein sc.pl?
[ ] External tools: Do you know an existing package that should go into sc.external.*?
[x] Other?

I recently ported SCTransform from R into python. Any interest in getting it onto Scanpy?

The original paper is here. It's a variance-stabilizing transformation that overcomes some key drawbacks of previous, similar methods (e.g. overfitting caused by building regression models from individual genes as opposed to groups of similar genes). It also eliminates the need for pseudocounts, log transformations, or library size normalization.

My code is here.

Implementation notes (from the SCTransformPy README):

Poisson regression is done using the statsmodels package and parallelized with multiprocessing.
Improved Sheather & Jones bandwidth calculation is implemented by the KDEpy package.
Estimating the negative binomial dispersion factor, theta, using MLE was translated from the theta.ml function in R.
Pearson residuals are automatically clipped to be non-negative. This ensures that sparsity structure can be preserved in the data. Practically, the results do not change much when allowing for dense, negative values.

Anecdotally, it produces very similar results to the R implementation, though the code itself is still a little rough around the edges. I also have to do more formal quantitative benchmarking to ensure results are similar to those of the original package.

I thought I'd gauge interest here prior to working on making it scanpy-ready.

ivirshup commented 3 years ago

Definitely love to have more contributions! I believe @LuckyMD and @giovp are quite keen on having this in the library.

Excited to see your benchmarks!

Side note: I'd definitely recommend looking into using joblib instead of multiprocessing for parallelization. It's a bit more simple to use, is much better about not oversubscribing your resources, and copying less data.

atarashansky commented 3 years ago

Thanks for the suggestion, @ivirshup! I'll take a look into that.

Would this be something that goes into sce.externals?

ivirshup commented 3 years ago

I think there's a pretty strong desire on the team to see something like this in the main API, if you'd like to contribute this there.

atarashansky commented 3 years ago

Gotcha! I'll prioritize getting the benchmarks up and then I'll need some guidance on how to organize it to fit in scanpy's codebase. Thanks!

LuckyMD commented 3 years ago

agreed, this would be super cool!

giovp commented 3 years ago

totally agree! would be really cool to have!

fidelram commented 3 years ago

I agree that this should not belong to 'external' but to the main API.

Also, I would not be initially concerned about performance. Having the tool first is more important at the moment. We can later see how can be optimized.

atarashansky commented 3 years ago

I looked at a few genes and it looks like I'm getting pretty similar residuals compared to the R implementation. Something weird is going on with the genes in the last couple images so I'm currently trying to figure that before generating more thorough benchmarks. (I clip negative values to zero to preserve sparsity structure)

atarashansky commented 3 years ago

Think I fixed it.

Similarity of cluster assignments based on the R and python corrected expressions is 0.9 (sklearn.metrics.v_measure_score).

Correlations between the top corrected gene expressions for the top 3000 varying genes between python and R:

Although the correlations are all close to 1.0, if you look at the scatter plot of all non-zero gene expressions, there seems to be some systematic difference in the linear shifting/scaling. Is this worth figuring out?

giovp commented 3 years ago

Hi @atarashansky , looks really good! Have couple of questions and comments:

Is it the dataset in the original publication? Could you try it in a couple of other datasets (like the 3k ?). If you can make a notebook on how to run your implementation I could help with that.
I would also plot the slope and intercept estimate against mean exp, as in original pubblication.

Re parallelization, @ivirshup already mentioned joblib. For reference, I want to share here a framework developed by @michalk8 that uses joblib for parallelizing a function over a collection of elements.

It is already used in 2 scanpy sisters packages: https://github.com/theislab/cellrank and https://github.com/theislab/scvelo , as well as a third one that is coming out next week.

Implementation in theislab/scvelo https://github.com/theislab/scvelo/blob/1659cc8e00a45fcf87cd80a7013aae5531744613/scvelo/tools/dynamical_model.py#L1172

Implementation in theislab/cellrank https://github.com/theislab/cellrank/blob/master/cellrank/ul/_parallelize.py

I feel this settings is very similar to the one above, so maybe useful to check it out.

Let me know what you think !

atarashansky commented 3 years ago

Here are the fit parameters vs the geometric mean expression applied to the 33k pbmc dataset. Looks pretty similar!

Thanks for pointing me towards the joblib parallelization. I’ll work on applying it here.
Notebook incoming! Stay tuned.

atarashansky commented 3 years ago

Here’s the notebook.

giovp commented 3 years ago

@atarashansky thanks a lot, looks really good indeed! I'll have time to have a look on Friday and will get back to you then! and thanks for the notebooks!

giovp commented 3 years ago

@atarashansky sorry for getting back to you late! I played around with the implementation, thanks a lot for the notebooks!

few questions before opening the PR:

do you think it's worth it to allow users to add other variables (beside log_umi) ?
do you think it would be useful to add other models other than poisson?
there are other outputs provided by R implementation other than pearson residuals. Do you think it's worth to include them?
testing: how do you think it should be best tested? we thought about saving results from original implementation in R and test against those (as it's done for others seurat re-implementation like highly variable genes)

looking forward to hear what you think! thank you!

gokceneraslan commented 3 years ago

This looks super cool! One minor thing is the bandwidth adjustment. Ideally this shouldn't require manual tweaking. Is this the case?

When I was playing around with Nebulosa (https://github.com/powellgenomicslab/Nebulosa) to reimplement it in python, I noticed that both bandwidth algorithms (scott and silverman) in scipy's gaussian_kde are really horrible. Bandwidths estimated by the ks package in R (https://cran.r-project.org/web/packages/ks/ks.pdf, which is what Nebulosa uses) were always superior.

atarashansky commented 3 years ago

Sorry for the late reply, the notifications for this thread got sent to my spam folder.

@giovp

I think so! It’s not difficult to extend it to more latent variables. We could allow them to specify any column(s) in the obs DataFrame.
Hmm, I think statsmodels can do regression on lots of different models, but from the source paper it sounds like using Poisson was simplest/fastest and did not affect the results too much when compared to negative binomial regression. I think parameter estimation for other models might be a bit more involved.
I think that would be pretty straightforward. What outputs are you referring to, specifically?
I’ve been testing by computing correlations between the genes from the python and R implementations. You could also compare rank-ordering of cells by variance. Another approach might be to compare the output of downstream analysis methods (like clustering) to see if the results are similar, and compare to the output of unprocessed data as a negative control.

atarashansky commented 3 years ago

@gokceneraslan

SCTransform does automatic bandwidth selection using the scott algorithm, which I am implementing with gaussian_kde. Although, I could change this to use bandwidth selection using KDEpy’s implementation, since I use their ISJ bandwidth algorithm for another step (the bandwidth algorithms used at different steps is following the methods write-up in the original paper).

gokceneraslan commented 3 years ago

I don't know how "easy" the density estimation problem here is, but I would avoid scott's whenever I can (e.g. https://aakinshin.net/posts/kde-bw/).

giovp commented 3 years ago

hi @atarashansky , sorryf or late reply again. I'm not sure about kde issue raised by @gokceneraslan , do you think it's worth to look into it? I think they use the same in original implementation though right?

For the rest, I think the two main points that could be addressed before PR are:

using same nomenclature for inferred params as original implementation (so not like _1 that is done now) and also returning same set of params. This would make it easier for users who are already familiar, and also for us for benchmarking.
allow the creation of a design matrix so that users could pass any covariate (I'd say batch is the most important and probably used in general).

with these two points addressed, I think it's good to start a PR, I'd be happy to review and also help in setting up tests/benchmarks (which like you mentioned, would be probably best to just test against the result of original implementation).

Let me know for anything else, exciting for this!

jlause commented 3 years ago

Hi @atarashansky and everyone following this interesting discussion!

I just found this issue after posting quite a related PR yesterday (#1715) that came out of a discussion from the end of last year (berenslab/umi-normalization#1), and wanted to leave a note about that relation here:

In my PR, I implement normalization by analytic Pearson residuals based on a NB offset model, which is an improved/simplified version of the scTransform model that does not need regularization by smoothing anymore... This brings some theoretical advantages and we found it works well in practice (details in this preprint with @dkobak ).

One of the differences remaining between the two is how the overdispersion theta is treated (scTransform: fitted per gene, analytical residuals: fixed to one theta for all genes based on negative controls). I think fixing theta like that makes a lot of sense, but also thought about adding a function that learns a global theta from the data. With some modifications that could be another fruitful use-case of your theta.ml python implementation @atarashansky.

Also, I'm curious about the clipping of the Pearson residuals to [0, sqrt(n/30)] in your method. We also find that clipping is an important step for obtaining sensible analytical residuals, and I recently though a bit about motivating different cutoffs - so I'd be interested to learn what is behind your choice of sqrt(n/30)!

Looking forward to you thoughts on this :) Jan.

atarashansky commented 3 years ago

@jlause Interesting work! It would indeed be nice to avoid having to learn bandwidths altogether.

What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector?

With regards to the clipping, I turned my brain off and copied the Seurat implementation as much as possible. sqrt(n/30) was the default parameter used by the SCTransform wrapper in Seurat. I also removed negative values to preserve sparsity structure of the data. Sorry I couldn't provide any insight about this!

dkobak commented 3 years ago

Oh, scTransform uses sqrt(n/30) by default? Interesting. If I remember correctly, the paper says they use sqrt(n)...

atarashansky commented 3 years ago

If I remember correctly, the SCTransform vst method uses sqrt(n) by default but the SCTransform wrapper in Seurat uses sqrt(n/30).

dkobak commented 3 years ago

What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector?

Regarding this -- yes, that's what we thought to do. Do you think it makes sense? However, for computational efficiency, I would threshold the genes by expression and take e.g. 1000 or even 100 genes with the highest average expression (for the purpose of estimating theta). Lowly-expressed genes don't really constrain theta much, it's highly-expressed ones that do.

hurleyLi commented 3 years ago

Hi guys, I've been following this thread and it's been quiet recently :) wondering if there's any updates on incorporating ScTransform on Scanpy. Thanks!! 🙏🏼

Moloch0 commented 1 year ago

Found it after I wrote it.😄 Here is my code for those who want to try 'R' SCT.

import anndata2ri
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
import numpy as np

anndata2ri.activate()
pandas2ri.activate()

def run_sctransform(adata, layer=None, **kwargs):
    if layer:
        mat = adata.layers[layer]
    else:
        mat = adata.X

    # Set names for the input matrix
    cell_names = adata.obs_names
    gene_names = adata.var_names
    r.assign('mat', mat.T)
    r.assign('cell_names', cell_names)
    r.assign('gene_names', gene_names)
    r('colnames(mat) <- cell_names')
    r('rownames(mat) <- gene_names')

    seurat = importr('Seurat')
    r('seurat_obj <- CreateSeuratObject(mat)')

    # Run
    for k, v in kwargs.items():
        r.assign(k, v)
    kwargs_str = ', '.join([f'{k}={k}' for k in kwargs.keys()])
    r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')

    # Extract the SCT data and add it as a new layer in the original anndata object
    sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@data')))
    adata.layers['SCT_data'] = sct_data.T
    sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@counts')))
    adata.layers['SCT_counts'] = sct_data.T
    return adata

adata.layers["data"] = adata.X.copy()

adata = run_sctransform(adata, layer="counts")

R[write to console]: Running SCTransform on assay: RNA
R[write to console]: Place corrected count matrix in counts slot
R[write to console]: Set default assay to SCT

adata
    layers: 'counts', 'data', 'SCT_data', 'SCT_counts'

Please use this code and your data with caution.

cpantea commented 1 year ago

Thank you everybody, and particularly @Moloch0, for contributing to this discussion. Your run_sctransform function is exactly what I needed for my analyses.

A little note in case anyone else might run into the same issue: my original AnnData contained a small set of genes/variables present in very few cells, which were filtered out during SCTransform normalisation. This prevented the SCT layers from being added to adata due to dimension mismatch.

To address this, I added a simple subsetting script to the function between normalisation and layer addition, as follows:

    #[...]
    r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')

    # Prevent partial SCT output because of default min.genes messing up layer addition
    r('diffDash <- setdiff(rownames(seurat_obj), rownames(mat))')
    r('diffDash <- gsub("-", "_", diffDash)')
    r('diffScore <- setdiff(rownames(mat), rownames(seurat_obj))')
    filtout_genes = svconvert(r('setdiff(diffScore, diffDash)'))
    filtout_indicator = np.in1d(adata.var_names, filtout_genes)
    adata = adata[:, ~filtout_indicator]

    # Extract the SCT data and add it as a new layer in the original anndata object
    #[...]

Hope that comes in handy for anyone else facing this issue!

scverse / scanpy

Any interest in getting SCTransform up on Scanpy? #1643