markrogoyski / math-php

Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
MIT License
2.33k stars 240 forks source link

[Feature request] Partial least squares regression #384

Closed Nyholm closed 3 years ago

Nyholm commented 3 years ago

I've looked at this library and I think Im correct when saying that Partial Least Squares (PLS) regression is missing. It is similar to the principal component regression that is implemented.

I need this feature and I was hoping that this project would accept a PR. Im not too confident on the math, so if someone know they can do this quickly, then I would be super happy. If not I'll make sure to struggle my way forward =)

See https://en.wikipedia.org/wiki/Partial_least_squares_regression

Question to the maintainer: Will you accept a PR with this?

Beakerboy commented 3 years ago

We’re happy for any improvements! Make sure you fully test the functions against the results from an independent source. For example, if a textbook has examples with solutions, or you can use R, octave, or numpy with the same data.

Beakerboy commented 3 years ago

Using the process from this document

https://personal.utdallas.edu/~herve/Abdi-PLS-pretty.pdf

I was able to use Excel to manually get the same results as the following R code:

library(mixOmics)

sample_data = as.data.frame(matrix(c(7, 4, 10, 16, 13, 7, 3, 5, 7, 3, 13, 14, 12, 11, 10, 7, 7, 5, 3, 3), nrow = 5, ncol = 4))
colnames(sample_data) <- c("input1","input2","input3","input4")
pred = as.data.frame(matrix(c(14, 10, 8, 2, 6, 7, 7, 5, 4, 2, 8, 6, 5, 7, 4), nrow = 5, ncol = 3))
sample_data
colnames(pred) <- c("output1", "output2", "output3")
pred
pls.model <- pls(sample_data, pred, ncomp=3)
pls.model$loadings

So that seems like a workable algorithm.

pls.model$loadings$X=W
pls.model$loadings$Y = C
pls.model$variates$X = each column proportional to T
pls.model$variates$Y = U
Nyholm commented 3 years ago

Thank you. This is much helpful. Do you mind share the excel doc too?

Beakerboy commented 3 years ago

I deleted the excel spreadsheet. PHP-ish psudocode follows. I have no idea how one actually uses these matrices in practice, so i would not be comfortable coding this class up myself with meaningful tests without a lot more research on my part. And I have not tested any of this in PHP.


// Prepare Data, can use similar process as in PCA class
$E = standardize columns of X
$F = Standardized columns of Y
// Ensure same number of rows
if ($E->getM() !== $F->getM()) {
  throw exception;
}
while (!$E->isEqual(MatrixFactory::zero($E->getM(), $E->getN())) {
  //Random Vector
  $u_new = MatrixFactory::random($F->getM(), 1)->scalarDivide(21);

  // Iterate this process until vectors converge
  do {
    // Sum of squares of each matrix to keep track of residuals.
    $SSX[] = ($E->frobeniusNorm()) ** 2;
    $SSY[] = ($Y->frobeniusNorm()) ** 2;
    $u = $u_new;
    // $w = Eᵀ * u / ‖Eᵀ * u‖; (unit vector in direction of Eᵀ * u);
    $temp = $E->transpose()->multiply($u);
    $w = $temp->scalarDivide($temp->frobeniusNorm());
    // $t = E * w / ‖E * w‖;
    $temp = $E->multiply($w);
    $t = $temp->scalarDivide($temp->frobeniusNorm());
    // $c = Fᵀ * t / ‖Fᵀ * t‖;
    $temp = $F->transpose()->multiply($t);
    $c = $temp->scalarDivide($temp->frobeniusNorm());
    $u_new = $F->multiply($c);
  } while (!$u_new->isEqual($u));

  $p = $E->transpose()->multiply($t);
  $b = $t->transpose()->multiply($u)->get(0,0); 
  $B[] = $b; // $B is a vector of each of the $b values. Eventually we should(?) convert $B to a diagonal matrix
  $T[] = $t;  // move $t to a new column of $T
              // and do the same with w, u, p, and c
  $E = $E->subtract($t->multiply($p->transpose()));
  $F = $F->subtract($t->multiply($c->transpose())->scalarMultiply($b));
}
Beakerboy commented 3 years ago

From the R documentation

“Many ... formulations can be found in literature. It has been shown, for instance, that only one of X and Y needs to be deflated; alternatively, one can directly deflate the crossproduct matrix S (as is done in SIMPLS, for example). Moreover, there are many equivalent ways of scaling. In the example above, the scores t have been normalised, but one can also choose to introduce normalisation at another point in the algorithm. Unfortu- nately, this can make it difficult to directly compare the scores and loadings of different PLSR implementations.“

Here’s another R PLS snippet that gives different scores but I might be doing something wrong with the setup.

library(pls)
sample_data = as.data.frame(scale(matrix(c(7, 4, 10, 16, 13, 7, 3, 5, 7, 3, 13, 14, 12, 11, 10, 7, 7, 5, 3, 3, 14, 10, 8, 2, 6, 7, 7, 5, 4, 2, 8, 6, 5, 7, 4), nrow = 5, ncol = 7), center = TRUE, scale = TRUE))
colnames(sample_data) <- c("x1","x2","x3","x4", "y1", "y2", "y3")
pls.model <- plsr(cbind(y1, y2, y3) ~ x1 + x2 + x3 + x4, data=sample_data, ncomp=3, method="oscorespls")

pls.model$scores
Beakerboy commented 3 years ago

Some code sources below: PLS2 looks like it is the most straightforward, and is what is described in the wikipedia article, however, most examples online appear to use the mvr function. It looks like several PLSR algorithms use SVD, which has not been integrated into mathPHP yet. PLS1/2 do not rely on SVD,

https://rdrr.io/cran/chemometrics/src/R/pls2_nipals.R https://rdrr.io/cran/pls/src/R/mvr.R

Beakerboy commented 3 years ago

As an update #385 is a pull request for PLS in MathPHP. I would appreciate any feedback if you are able to review it to see if it gives you your expected results, and that it has the features you would expect.

Beakerboy commented 3 years ago

This can be closed

markrogoyski commented 3 years ago

PLS is available in the latest release v2.2.0