Open alexdupre opened 1 week ago
After digging into the issue it looks like the problem is caused by the values of a feature column that are all equal.
In theory, this scenario should have been caught before calling the dotprf
function, here: https://github.com/haifengl/smile/blob/76f79ecf902d066fa005ba86f7d51f6c176df458/core/src/main/java/smile/regression/RidgeRegression.java#L184-L188
In practice, the constant check passed because the calculated standard deviation was NaN
instead of a zero in machine precision.
This snippet, simulating a vector of length 48 with a constant value 62571.43, can easily show the issue:
val c = 62571.43
val m = 48
val column = Array.fill(m)(c)
val sum = column.sum
val mean = sum / m // 62571.43000000003
val sumsq = column.map(v => v * v).sum // 1.8792882490775534E11
val variance = sumsq / m - mean * mean // -4.76837158203125E-7
val sd = Math.sqrt(variance) // NaN
val isZero = MathEx.isZero(sd) // false
The number that get passed to Math.sqrt
is negative instead of zero because of floating point arithmetic and the algortihm used by colSds
, and so the result is NaN. I'm not sure what's the correct way to fix the standard deviation function to work in this scenario, but the algorithm implemented by the commons math library correctly returns a variance of 0 in this case: https://github.com/apache/commons-math/blob/e580cde5f77019bb1b60c195a38de56ca4f5dcb0/commons-math-legacy/src/main/java/org/apache/commons/math4/legacy/stat/descriptive/moment/Variance.java#L399-L425
Thanks for deep dive! I have added safeguard on the colSds computation. Please try the master branch if it fixed the issue.
I haven't tried it, yet, but the fix seems fine for my case (I don't know if there is the possibility that with some data the math error produces a small positive variance that is then not considered a zero)
While I was experimenting with the Ridge Regression I found another unexpected error: if I use a lambda
equal to 0
, to emulate a simple Linear Regression, I get another LAPACK error code in the same potrf
function. Looking at the following code it seems 0 should be a supported value: https://github.com/haifengl/smile/blob/528ab52422a5b104e997ee86cf3b9e02188d3f1b/core/src/main/java/smile/regression/RidgeRegression.java#L167-L171
Describe the bug During the Ridge regression, for some slices of the original training data, the
dpotrf
function returns the error code-4
, even if thelda
argument is greater than then
argument.Expected behavior No error. FWIW, the ridge regression with scikit-learn on the same data works fine.
Actual behavior
Code snippet
Input data bug.csv
Additional context