PLS2 regressor worse than baseline for unifrom random data

lalvim / PartialLeastSquaresRegressor.jl

Implementation of a Partial Least Squares Regressor

MIT License

40 stars 8 forks source link

PLS2 regressor worse than baseline for unifrom random data #12

Closed Kolaru closed 3 years ago

Kolaru commented 3 years ago

I tried to fit uninformative data (random, uniform, and centered) with PLS2 and the regressor was unable to learn the baseline (note that I am using the MLJ interface from #10).

regressor = PLS(n_factors=1)

X = rand(1000, 5) .- 0.5
y = rand(1000, 2) .- 0.5
plsmachine = MLJ.machine(regressor, MLJ.table(X), MLJ.table(y))
MLJ.fit!(plsmachine)

pred = MLJ.predict(plsmachine)
yhat = MLJ.matrix(pred)

# Error of the model
println(sum((y .- yhat).^2))  # 249.26
# Baseline prediction yhat = 0
println(sum(y.^2))  # 166.33

I would expect the error to be not worse for the PLS2 model here since by learning every internal parameters to be zero, it would always return [0, 0] as output and match the baseline prediction.

Scikit learn version on the other hand works as expected. It doesn't quite learn all parameters to be zero, but the final error matches the baseline's one.

lalvim commented 3 years ago

Hi @Kolaru , I will check.

the change to mlj is still recent and is in the mlj branch. An important thing to note is that now the internal normalization of data no longer occurs and this was often fundamental. One thing to test is to insert the normalization via MLJ which is done in the unit testing examples. If it doesn't really work, I'll have to investigate the algorithm further.

Kolaru commented 3 years ago

the change to mlj is still recent and is in the mlj branch.

Yeah, I wanted to use the nice MLJ workflow so much I want so far as using a not yet merged branch =D. I still reported the issue as for what I see the underlying algorithm was not changed.

One thing to test is to insert the normalization via MLJ which is done in the unit testing examples.

The poor result also appear when using Standardizer in a pipeline, like it is done in the PR tests.

using PLSRegressor
using MLJ

import PLSRegressor: PLS

X = rand(1000, 5) .- 0.5
y = rand(1000, 2) .- 0.5

regressor = PLS(n_factors=1)

model = @pipeline Standardizer regressor target=Standardizer
plsmachine = MLJ.machine(model, MLJ.table(X), MLJ.table(y))
MLJ.fit!(plsmachine)

pred = MLJ.predict(plsmachine)
yhat = MLJ.matrix(pred)

# Error of the model
println(sum((y .- yhat).^2))  # 226.69
# Baseline prediction yhat = 0
println(sum(y.^2))  # 163.09

lalvim commented 3 years ago

@Kolaru . A curiosity. Did you perform this same test with PLS1 and KPLS? Just to know how the situation is for the others.

Kolaru commented 3 years ago

I haven't tested PLS1 and KPLS yet, but overnight I realized I misunderstood how PLS works: since the weights matrices are orthogonal (or at least each of their column has unit norm) the method can not set all coefficients to zero. Sklearn seems to bypass that and have a way to converge the algorithm without enforcing that the y weights are normalized, hence returning a nearly zero Q matrix and producing a better prediction in this case.

lalvim commented 3 years ago

So at first, we can close this issue.

I will be working this week to merge the mlj branch soon.

Kolaru commented 3 years ago

Yeah this issue can be close. If you want to keep track of the discrepency with sklearn or keep track of the possible improvement it, I can reformulate it for a new issue.

I will be working this week to merge the mlj branch soon.

Great! :)

lalvim commented 3 years ago

possible improvement

@Kolaru Nice! open a new issue regarding this.