Open atarashansky opened 3 years ago
Definitely love to have more contributions! I believe @LuckyMD and @giovp are quite keen on having this in the library.
Excited to see your benchmarks!
Side note: I'd definitely recommend looking into using joblib
instead of multiprocessing
for parallelization. It's a bit more simple to use, is much better about not oversubscribing your resources, and copying less data.
Thanks for the suggestion, @ivirshup! I'll take a look into that.
Would this be something that goes into sce.externals
?
I think there's a pretty strong desire on the team to see something like this in the main API, if you'd like to contribute this there.
Gotcha! I'll prioritize getting the benchmarks up and then I'll need some guidance on how to organize it to fit in scanpy's codebase. Thanks!
agreed, this would be super cool!
totally agree! would be really cool to have!
I agree that this should not belong to 'external' but to the main API.
Also, I would not be initially concerned about performance. Having the tool first is more important at the moment. We can later see how can be optimized.
I looked at a few genes and it looks like I'm getting pretty similar residuals compared to the R implementation. Something weird is going on with the genes in the last couple images so I'm currently trying to figure that before generating more thorough benchmarks. (I clip negative values to zero to preserve sparsity structure)
Think I fixed it.
Similarity of cluster assignments based on the R and python corrected expressions is 0.9 (sklearn.metrics.v_measure_score
).
Correlations between the top corrected gene expressions for the top 3000 varying genes between python and R:
Although the correlations are all close to 1.0, if you look at the scatter plot of all non-zero gene expressions, there seems to be some systematic difference in the linear shifting/scaling. Is this worth figuring out?
Hi @atarashansky , looks really good! Have couple of questions and comments:
Re parallelization, @ivirshup already mentioned joblib. For reference, I want to share here a framework developed by @michalk8 that uses joblib for parallelizing a function over a collection of elements.
It is already used in 2 scanpy sisters packages: https://github.com/theislab/cellrank and https://github.com/theislab/scvelo , as well as a third one that is coming out next week.
Implementation in theislab/scvelo https://github.com/theislab/scvelo/blob/1659cc8e00a45fcf87cd80a7013aae5531744613/scvelo/tools/dynamical_model.py#L1172
Implementation in theislab/cellrank https://github.com/theislab/cellrank/blob/master/cellrank/ul/_parallelize.py
I feel this settings is very similar to the one above, so maybe useful to check it out.
Let me know what you think !
Thanks for pointing me towards the joblib parallelization. I’ll work on applying it here.
Notebook incoming! Stay tuned.
Here’s the notebook.
@atarashansky thanks a lot, looks really good indeed! I'll have time to have a look on Friday and will get back to you then! and thanks for the notebooks!
@atarashansky sorry for getting back to you late! I played around with the implementation, thanks a lot for the notebooks!
few questions before opening the PR:
looking forward to hear what you think! thank you!
This looks super cool! One minor thing is the bandwidth adjustment. Ideally this shouldn't require manual tweaking. Is this the case?
When I was playing around with Nebulosa (https://github.com/powellgenomicslab/Nebulosa) to reimplement it in python, I noticed that both bandwidth algorithms (scott
and silverman
) in scipy's gaussian_kde are really horrible. Bandwidths estimated by the ks
package in R (https://cran.r-project.org/web/packages/ks/ks.pdf, which is what Nebulosa uses) were always superior.
Sorry for the late reply, the notifications for this thread got sent to my spam folder.
@giovp
obs
DataFrame.statsmodels
can do regression on lots of different models, but from the source paper it sounds like using Poisson was simplest/fastest and did not affect the results too much when compared to negative binomial regression. I think parameter estimation for other models might be a bit more involved.@gokceneraslan
SCTransform does automatic bandwidth selection using the scott algorithm, which I am implementing with gaussian_kde
. Although, I could change this to use bandwidth selection using KDEpy
’s implementation, since I use their ISJ bandwidth algorithm for another step (the bandwidth algorithms used at different steps is following the methods write-up in the original paper).
I don't know how "easy" the density estimation problem here is, but I would avoid scott's whenever I can (e.g. https://aakinshin.net/posts/kde-bw/).
hi @atarashansky , sorryf or late reply again. I'm not sure about kde issue raised by @gokceneraslan , do you think it's worth to look into it? I think they use the same in original implementation though right?
For the rest, I think the two main points that could be addressed before PR are:
_1
that is done now) and also returning same set of params. This would make it easier for users who are already familiar, and also for us for benchmarking.with these two points addressed, I think it's good to start a PR, I'd be happy to review and also help in setting up tests/benchmarks (which like you mentioned, would be probably best to just test against the result of original implementation).
Let me know for anything else, exciting for this!
Hi @atarashansky and everyone following this interesting discussion!
I just found this issue after posting quite a related PR yesterday (#1715) that came out of a discussion from the end of last year (berenslab/umi-normalization#1), and wanted to leave a note about that relation here:
In my PR, I implement normalization by analytic Pearson residuals based on a NB offset model, which is an improved/simplified version of the scTransform model that does not need regularization by smoothing anymore... This brings some theoretical advantages and we found it works well in practice (details in this preprint with @dkobak ).
One of the differences remaining between the two is how the overdispersion theta
is treated (scTransform: fitted per gene, analytical residuals: fixed to one theta
for all genes based on negative controls). I think fixing theta
like that makes a lot of sense, but also thought about adding a function that learns a global theta
from the data. With some modifications that could be another fruitful use-case of your theta.ml
python implementation @atarashansky.
Also, I'm curious about the clipping of the Pearson residuals to [0, sqrt(n/30)]
in your method. We also find that clipping is an important step for obtaining sensible analytical residuals, and I recently though a bit about motivating different cutoffs - so I'd be interested to learn what is behind your choice of sqrt(n/30)
!
Looking forward to you thoughts on this :) Jan.
@jlause Interesting work! It would indeed be nice to avoid having to learn bandwidths altogether.
What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector?
With regards to the clipping, I turned my brain off and copied the Seurat implementation as much as possible. sqrt(n/30)
was the default parameter used by the SCTransform wrapper in Seurat. I also removed negative values to preserve sparsity structure of the data. Sorry I couldn't provide any insight about this!
Oh, scTransform uses sqrt(n/30)
by default? Interesting. If I remember correctly, the paper says they use sqrt(n)
...
If I remember correctly, the SCTransform vst
method uses sqrt(n)
by default but the SCTransform
wrapper in Seurat uses sqrt(n/30)
.
What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector?
Regarding this -- yes, that's what we thought to do. Do you think it makes sense? However, for computational efficiency, I would threshold the genes by expression and take e.g. 1000 or even 100 genes with the highest average expression (for the purpose of estimating theta). Lowly-expressed genes don't really constrain theta much, it's highly-expressed ones that do.
Hi guys, I've been following this thread and it's been quiet recently :) wondering if there's any updates on incorporating ScTransform on Scanpy. Thanks!! 🙏🏼
Found it after I wrote it.😄 Here is my code for those who want to try 'R' SCT.
import anndata2ri
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
import numpy as np
anndata2ri.activate()
pandas2ri.activate()
def run_sctransform(adata, layer=None, **kwargs):
if layer:
mat = adata.layers[layer]
else:
mat = adata.X
# Set names for the input matrix
cell_names = adata.obs_names
gene_names = adata.var_names
r.assign('mat', mat.T)
r.assign('cell_names', cell_names)
r.assign('gene_names', gene_names)
r('colnames(mat) <- cell_names')
r('rownames(mat) <- gene_names')
seurat = importr('Seurat')
r('seurat_obj <- CreateSeuratObject(mat)')
# Run
for k, v in kwargs.items():
r.assign(k, v)
kwargs_str = ', '.join([f'{k}={k}' for k in kwargs.keys()])
r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')
# Extract the SCT data and add it as a new layer in the original anndata object
sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@data')))
adata.layers['SCT_data'] = sct_data.T
sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@counts')))
adata.layers['SCT_counts'] = sct_data.T
return adata
adata.layers["data"] = adata.X.copy()
adata = run_sctransform(adata, layer="counts")
R[write to console]: Running SCTransform on assay: RNA
R[write to console]: Place corrected count matrix in counts slot
R[write to console]: Set default assay to SCT
adata
layers: 'counts', 'data', 'SCT_data', 'SCT_counts'
Please use this code and your data with caution.
Thank you everybody, and particularly @Moloch0, for contributing to this discussion. Your run_sctransform
function is exactly what I needed for my analyses.
A little note in case anyone else might run into the same issue: my original AnnData contained a small set of genes/variables present in very few cells, which were filtered out during SCTransform normalisation. This prevented the SCT layers from being added to adata due to dimension mismatch.
To address this, I added a simple subsetting script to the function between normalisation and layer addition, as follows:
#[...]
r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')
# Prevent partial SCT output because of default min.genes messing up layer addition
r('diffDash <- setdiff(rownames(seurat_obj), rownames(mat))')
r('diffDash <- gsub("-", "_", diffDash)')
r('diffScore <- setdiff(rownames(mat), rownames(seurat_obj))')
filtout_genes = svconvert(r('setdiff(diffScore, diffDash)'))
filtout_indicator = np.in1d(adata.var_names, filtout_genes)
adata = adata[:, ~filtout_indicator]
# Extract the SCT data and add it as a new layer in the original anndata object
#[...]
Hope that comes in handy for anyone else facing this issue!
sc.tools
?sc.pl
?sc.external.*
?I recently ported SCTransform from R into python. Any interest in getting it onto Scanpy?
The original paper is here. It's a variance-stabilizing transformation that overcomes some key drawbacks of previous, similar methods (e.g. overfitting caused by building regression models from individual genes as opposed to groups of similar genes). It also eliminates the need for pseudocounts, log transformations, or library size normalization.
My code is here.
Implementation notes (from the SCTransformPy README):
statsmodels
package and parallelized withmultiprocessing
.KDEpy
package.theta
, using MLE was translated from thetheta.ml
function in R.Anecdotally, it produces very similar results to the R implementation, though the code itself is still a little rough around the edges. I also have to do more formal quantitative benchmarking to ensure results are similar to those of the original package.
I thought I'd gauge interest here prior to working on making it
scanpy
-ready.