satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.3k stars 917 forks source link

CLR transform appears incorrect #2624

Closed scharch closed 4 years ago

scharch commented 4 years ago

As noted in #1268 and #2159, a standard CLR transformation should result in negative values. The log1p function appears to be incorrect here, as it returns log(1+(x/geomean(x)) when, to the best of my understanding, the pseudocount should actually be included in the ratio, ie log((1+x)/geomean(x)). This is important, because the negative binomial used to model the CLR-transformed HTO values won't fit the (manual) distribution shown in #1268. Is there a theoretical reason that the function is set up the way it is, or does it just happen to work?

satijalab commented 4 years ago

In our normalization methods in Seurat, we typically avoid returning negative values (those are reserved for a future scaling steps). This aids with visualization, interpretation, and also (as you state) modeling.

This is why we add the pseudocount where we do, precisely to avoid these negative values. In that sense, I agree that it is a modified CLR transformation.

wmacnair commented 1 year ago

Hi

I have just discovered for myself that the CLR transform used in Seurat is not actually the CLR. I don't think that the response that "we typically avoid returning negative values" is adequate. This implementation is just wrong, or at the very least not the CLR.

The approach currently implemented in Seurat means that the ADTs forming lower proportions all have their values truncated to zero or very nearly, and therefore do not contribute anything to distances between cells.

This plot shows the correlation between CLR values calculated for an ADT matrix I have been analysed; x-axis = correct CLR, y-axis = Seurat implementation. It is very clear that where there should be a wide range of negative values, these values are crunched to zero. For ADTs with consistently relatively low values, these dimensions therefore don't contribute to the CLR distance, despite containing useful information. seurat_clr_check

At the very least, the documentation should be changed to reflect that this is not the CLR.

However, it would be more useful if Seurat changed to use the proper CLR transform, as this should give more meaningful distances.

The users are not children! Negative numbers have been around for 2000+ years, and we are somewhat used to them now :)

Thanks Will

jackkamm commented 9 months ago

Also finding this thread now after realizing Seurat-CLR is not really a CLR.

In actuality, it is using a library size normalization based on the geometric mean, adding pseudocounts to prevent zeros. It's closely related to CLR, and a reasonable thing to do, though the median might be preferable over the geometric mean as discussed here: https://bioconductor.org/books/3.14/OSCA.advanced/integrating-with-protein-abundance.html#library-size-normalization

However, I think the bigger problem is that Seurat-CLR normalizes samples so that the geometric mean is 1, and then adds a pseudocount of 1. This is quite a large pseudocount relative to the normalized sample, and results in counts below the geometric mean getting "crushed" to zero as observed by @wmacnair . I think it might be better to scale to "counts per 100" (or 10, or 1000, or a user-defined value) before adding the "outer" pseudocount, i.e. using something like

log1p(100 * x / exp(mean(log1p(x))))

instead of the current normalization which is

clr_function <- function(x) {
  return(log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x)))))
}
behzadk commented 8 months ago

A possible issue with using the 'un-modified' CLR function:

The negative values assigned to zeros in the dataset will be heavily determined by the sparsity of the sample. As such those previously zero values now encode a lot of information about the rest of the sample. In the 'seurat-flavoured' CLR, the zero values remain zero and therefore do not encode information about the rest of the sample.

This can be quite damaging from a differential expression perspective so I can see why this 'seurat-flavoured' CLR is more suitable.

I would be interested to hear other opinions on this!

jackkamm commented 8 months ago

Yes indeed, transforming zeros to nonzeros can be dangerous as illustrated by this paper: https://journals.asm.org/doi/10.1128/mbio.01607-23 And there are other benefits to preserving zeros for visualization and computational efficiency.

On the other hand there are situations where one might want to transform zeros -- 0 out of 100 is weaker evidence of absence than 0 out of 1,000,000, for example. SCTransform and scry's null deviance residuals are two examples of transformations that don't preserve zeros.

When preserving zeros, the size of the pseudocount will affect how much things near zero get squished. I wonder if in some settings Seurat's pseudocount might be too large, causing too much squishing near zero.