Comparison to linear regression

pavlin-policar commented 4 years ago

Hi, first of all, thanks for the interesting work and useful package. I've been playing around with this in Seurat and have found it immensely useful. However, my primary environment is Python, and this doesn't exist there yet, so I thought I would try a simple linear regression instead of the binomial regression since that's super simple to implement.

Curiously enough, the resulting embeddings I got using regular linear regression were visually very similar to the ones I got through sctransform. Having gone through your paper, you don't seem to mention this but focus on the comparison between log-transform and CPM normalization. Did you do any comparisons with linear regression? I'm curious, are there any really bad pitfalls to using linear regression alone that I'm not seeing (I mean other than the fact that the binomial distribution is much better suited to the data)? I can do the same batch-effect removal by adding a variable indicating batch id. I don't need any of the complicated regularization, and the results I'm getting are fairly similar while running much faster.

I'm sure this was your starting point for sctransform and I'm interested if there's anything inherently wrong with using linear regression. I can scale that to larger data sets, while sctransform runs for hours, linear regression is done in a matter of minutes. I'm just curious to hear your thoughts on this.

ChristophH commented 4 years ago

Hi, For medium-high to highly expressed genes the constant variance assumption does not hold. As a result standard linear regression would leave larger residuals for cells with lots of UMI compared to cells with few UMI. Depending on the downstream analysis steps this might not always be a problem, but I suggest that you take a dataset that you know well and compare sctranform and standard linear regression w.r.t. variance explained by cell UMI count.

Also, by not regularizing you risk regressing out too much. For lowly expressed genes the model parameters (intercept, slope) will have large uncertainty due to the limited number of non-zero observations. We obtain more robust estimates by essentially pooling information across genes.

I agree that sctransform is somewhat slow. Since the future package does not enable multiprocess in RStudio by default anymore, many people are stuck with just one CPU thread. Another way to speed things up is to use fewer cells for the model-building step (n_cells parameter to vst())

pavlin-policar commented 4 years ago

Hi, thanks for the quick reply! Yes, that makes sense, I guess I'll just have to play around with it a bit more.

Another way to speed things up is to use fewer cells for the model-building step...

Here, I suppose you mean that to fit the initial regressions for each gene separately, you don't use all the genes but use a smaller sample of e..g 2000 genes, then you do the information pooling with kernel regression. Is that right?

One more thing that popped into my mind. This approach allows us to elegantly address batch effects, but do you run into any issues when trying to analyze one large data set and a small one at the same time? I mean, if one data set contains e.g. 50k samples and the other contains 1k samples, then it seems to me that the parameter values might be skewed by the larger data set. Any thoughts?

satijalab / sctransform

Comparison to linear regression #43