normalization in genie3

vahuynh / GENIE3

Machine learning-based approach for the inference of gene regulatory networks from expression data.

68 stars 32 forks source link

normalization in genie3 #3

Closed ccshao closed 4 years ago

ccshao commented 5 years ago

There is a change about the normalization in Genie3. In the very early R codes I downloaded from your homepage, the input matrix are standardized:

expr.matrix <- apply(expr.matrix, 2, function(x) { (x - mean(x)) / sd(x) } )

However, this step is removed in the new R codes, only scale the y

y <- y / sd(y)

In the paper it is said that standardization is need to remove bias. So any dates on why now standardization is not recommended?

Thanks!

vahuynh commented 4 years ago

In each regression problem, only the output needs to be normalised (so that the sum of the input feature importances is equal to one). It does not change anything to the importances, whether you normalise the inputs or not.

ccshao commented 4 years ago

Thanks, I stick to the old way described in the paper.

OceaneCsn commented 4 years ago

Hi,

I have a related question about normalisation in GENIE3. I have been reading the source code from bioconductor and I noticed some differences from this repository. For instance, in your code, you normalize the response variable before calling the random forests :

y <- y / sd(y)

But there is no such step in the bioconductor code. Instead, a step performed after the random forests : im <- im / sum(im)

Is this supposed to be equivalent to normalising the output? And also, in return(weight.matrix / num.samples) in your code, why this division by the number of samples, that is not present in the bioconductor version of the code?

Thanks a lot!

vahuynh commented 4 years ago

Hi,

The two codes indeed apply different normalizations, but they are in fact equivalent.

Without any normalization (of the importances or of the data), the sum of features importances derived from a Random forest model is roughly equal to N*var(y), where N is the number of samples and var(y) is the variance of the output in the learning set. To ensure that feature importances are comparable from one model to another (i.e., from one target gene to another), GENIE3 applies a normalization so that the importances in each model sum up to 1. This can be done in two ways:

Either normalize the output y so that its variance is 1 and divide the resulting importances by N (like in the code here).
Either divide each importance by the sum of all importances (like in the Bionconductor code).

Best, Vân Anh

OceaneCsn commented 4 years ago

Hi,

Thank you very much for your detailed answer, I've got it all clear now.

Best regards, Océane

OceaneCsn commented 3 years ago

Hello again,

I have a new question about normalisation. In the case where the importance metric is "%IncMSE", I can see that the response is not normalized, and I'm realy curious about the reason why. Isn't the MSE influenced by the response characteristics such as its mean or variance?

Thank you very much,

Océane

vahuynh commented 3 years ago

Hi Océane,

In the case of "%IncMSE", the response is not normalized because the permutation-based importances are already normalized (as explained in the documentation of the randomForest R package, the importance is the mean difference over the trees, normalized by the standard deviation of the differences). That being said, I've never done any deep analysis of the permutation importance in the context of GENIE3, and hence there may be better ways to do the normalization.

Best,

Vân Anh