nolanlab / citrus

Citrus Development Code
GNU General Public License v3.0
31 stars 20 forks source link

problem with scale in cluster plots #84

Closed gaudilliere closed 8 years ago

gaudilliere commented 8 years ago

The scales of the cluster plots don't match the range of expression of the clustering markers. For example: The range of CD4 expression in a recent plot is [-0.99 to +1.63] HLADR ranges from [-1.01 to +1.54]

These markers in previous plots (i.e. 2013!!! version) correctly ranged from ~ [-0.05 to +5] in arcsinh transform values.

It almost looks like a arcsinh "ratio" of some sort is calculated, or just that the scale arguments have an error. Brice

SamGG commented 8 years ago

No idea, but it may be look like a scaling, ie, centering and dividing by some dispersion coefficient or range.

rbruggner commented 8 years ago

When you scale the parameters before clustering, it does indeed center and divide all values to normalize the range of the data. Specifically, it centers the mean of the distribution of observations at zero and divides all values to get a standard deviation of 1. This happens after the data are asinh transformed (if you specified that things should be scaled, too).

Those ranges are what you'd expect to see for those values, if you're using the scaling argument. So I don't think there's anything "broken" here, per se.

However, you might want citrus to "unscale" (but not, "untransform") the data when plotting the results. This is definitely possible (from a code-perspective) but I'm hesitant to make it do that by default. Reason being, if citrus "unscaled" the values but did not "untransform" them before plotting, the behavior would be inconsistent between those two transform operations.

One possible option would be to put the scale of the plots in terms of absolute values instead of arcsinh / scaled values.

gfragiadakis commented 8 years ago

How does scaling the parameters affect the models? (I believe glmnet and other regularized algorithms already scale the features automatically?)

rbruggner commented 8 years ago

The scaling component was mostly put it to alter the behavior of the clustering (i.e. make sure all the parameters had the same general dynamic range). I don't think it would have a big effect on the regression models outside of the determination of the clusters.

SamGG commented 8 years ago

I share Robert's opinion concerning the aim of the post-asinh-transform. It aims at scaling the various channels in order to make them covering the same general dynamic range. Why doing this? Because clustering is based on distance (Euclidean distance if I remember well the code and the Supp Methods), and we want every channel to have the same importance in the distance computation. Is such a centering and standardizing process useful? It is quick and easy, but there are points to notice. I have just highlighted some of those using a short simulation. http://rpubs.com/SamGG/ScalingStand. Currently, I prefer customizing the scaling. There was a related discussion on Spade concerning the scaling coefficient of the asinh. https://github.com/nolanlab/spade/issues/119

gfragiadakis commented 8 years ago

Ah ok its for the clustering. That makes sense. Thanks!

On Mon, Nov 16, 2015 at 4:37 AM SamGG notifications@github.com wrote:

I share Robert's opinion concerning the aim of the post-asinh-transform. It aims at scaling the various channels in order to make them covering the same general dynamic range. Why doing this? Because clustering is based on distance (Euclidean distance if I remember well the code and the Supp Methods), and we want every channel to have the same importance in the distance computation. Is such a centering and standardizing process useful? It is quick and easy, but there are points to notice. I have just highlighted some of those using a short simulation. http://rpubs.com/SamGG/ScalingStand. Currently, I prefer customizing the scaling. There was a related discussion on Spade concerning the scaling coefficient of the asinh. nolanlab/spade#119 https://github.com/nolanlab/spade/issues/119

— Reply to this email directly or view it on GitHub https://github.com/nolanlab/citrus/issues/84#issuecomment-157015140.

SamGG commented 8 years ago

Although that scaling may be useful for clustering, I am wondering if there a GUI option that inactivates the scaling, ie applies only the asinh transform. Currently, I don't feel how the display of the data in the original/raw/untransformed (as proposed by Robert previously) would fit the usual perception of a practitioner in front of a bi-channel plot.

rbruggner commented 8 years ago

I don't believe scaling is applied by default. If you do not select "scaling" channels in the clustering setup component of the GUI, it should not scale the data. Let me know if that is not the case.

SamGG commented 8 years ago

Thanks.