haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.02k stars 1.13k forks source link

Unexpected error during Ridge Regression: `LAPACK GETRF error code: -4` #788

Open alexdupre opened 1 week ago

alexdupre commented 1 week ago

Describe the bug During the Ridge regression, for some slices of the original training data, the dpotrf function returns the error code -4, even if the lda argument is greater than the n argument.

Expected behavior No error. FWIW, the ridge regression with scikit-learn on the same data works fine.

Actual behavior

Iteration: 0
Iteration: 1
Iteration: 2
10:16:43.137 [main] ERROR smile.math.matrix.Matrix - LAPACK GETRF error code: -4
Exception in thread "main" java.lang.ArithmeticException: LAPACK GETRF error code: -4
    at smile.math.matrix.Matrix.cholesky(Matrix.java:1732)
    at smile.regression.RidgeRegression.fit(RidgeRegression.java:206)
    at smile.regression.RidgeRegression.fit(RidgeRegression.java:116)
    at smile.regression.package$.$anonfun$ridge$1(package.scala:115)
    at smile.util.package$time$.apply(package.scala:67)
    at smile.regression.package$.ridge(package.scala:115)
    at Bug$.$anonfun$new$1(Bug.scala:15)

image

Code snippet

import smile._
import smile.data.formula._
import smile.regression._

import scala.language.postfixOps

object Bug extends App {

  val df = read.csv("bug.csv")
  (0 to 3).foreach { i =>
    println(s"Iteration: $i")
    val trainData = df.slice(i, 50 + i)
    ridge("Label" ~, trainData, 1)
  }

}

Input data bug.csv

Additional context

alexdupre commented 22 hours ago

After digging into the issue it looks like the problem is caused by the values of a feature column that are all equal.

In theory, this scenario should have been caught before calling the dotprf function, here: https://github.com/haifengl/smile/blob/76f79ecf902d066fa005ba86f7d51f6c176df458/core/src/main/java/smile/regression/RidgeRegression.java#L184-L188 In practice, the constant check passed because the calculated standard deviation was NaN instead of a zero in machine precision.

This snippet, simulating a vector of length 48 with a constant value 62571.43, can easily show the issue:

  val c = 62571.43
  val m = 48
  val column = Array.fill(m)(c)

  val sum = column.sum
  val mean = sum / m // 62571.43000000003
  val sumsq = column.map(v => v * v).sum // 1.8792882490775534E11
  val variance = sumsq / m - mean * mean // -4.76837158203125E-7

  val sd = Math.sqrt(variance) // NaN
  val isZero = MathEx.isZero(sd) // false

The number that get passed to Math.sqrt is negative instead of zero because of floating point arithmetic and the algortihm used by colSds, and so the result is NaN. I'm not sure what's the correct way to fix the standard deviation function to work in this scenario, but the algorithm implemented by the commons math library correctly returns a variance of 0 in this case: https://github.com/apache/commons-math/blob/e580cde5f77019bb1b60c195a38de56ca4f5dcb0/commons-math-legacy/src/main/java/org/apache/commons/math4/legacy/stat/descriptive/moment/Variance.java#L399-L425

haifengl commented 6 hours ago

Thanks for deep dive! I have added safeguard on the colSds computation. Please try the master branch if it fixed the issue.

alexdupre commented 51 minutes ago

I haven't tried it, yet, but the fix seems fine for my case (I don't know if there is the possibility that with some data the math error produces a small positive variance that is then not considered a zero)

While I was experimenting with the Ridge Regression I found another unexpected error: if I use a lambda equal to 0, to emulate a simple Linear Regression, I get another LAPACK error code in the same potrf function. Looking at the following code it seems 0 should be a supported value: https://github.com/haifengl/smile/blob/528ab52422a5b104e997ee86cf3b9e02188d3f1b/core/src/main/java/smile/regression/RidgeRegression.java#L167-L171