markrogoyski / math-php

Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
MIT License
2.32k stars 238 forks source link

Multivariate Linear Regression support #260

Open sbrbot opened 6 years ago

sbrbot commented 6 years ago

There is a support for Regression in MathPHP but that's only the Simple Linear Regression. Is there an intention to support Multivariate (Linear) Regression in future or other types of regressions (like simple exp or 2. polynomial regression)?

markrogoyski commented 6 years ago

Hi, thanks for your interest in MathPHP. Yes, the intention is to support all the useful statistical regression methods. Since there is now a feature request for these we can prioritize these features in the backlog.

I'll update this issue thread when there is progress on implementing these features. Thanks again for your input.

Beakerboy commented 6 years ago

I’m pretty sure the Linear regression trait is written in a generic enough way that it will work for multiple linear regressions. All it would take is a polynomial model and a class that joins the trait with the model.

Edit: Looking at the code, I think it’s good for polynomial, but not multiple linear. However, only some small tweaks would need to be done.

sbrbot commented 6 years ago

I agree, only small tweeks would be needed to implement multivariate linear regression since it is just matter of matrix calculation (ß=(X'X).inv)X'Y) and MathPHP does have support for matrices . I do have one PHP library for such a regression but it uses external library for matices wich does not implement Gauss-Jordan optimization for inverse matrix calculation and calculation is too slow for bigger matirces. Instead I could use MathPHP's matix calculator. But since MathPHP has intention to cover broad spectra of mathematical areas, the support for multivariate linear regression could be implemented.

In MathPHP syntax calculating multivariate linear regression coefficients would be:

$ß=$X->transpose()->multiply($X)->inverse()->mutiply($X->transpose())->multiply($Y);

sbrbot commented 6 years ago

I intend will switch from old regression library to MathPHP but just let me check if MathPHP's matrix->inverse() method implements Gauss-Jordan optimization/elimination (or any other better optimization if there is one). :-)

I checked inside code, but NO, unfortunately MathPHP does not implement Gauss-Jordan optimization for calculating inverse matrix but uses standard calculation with augmented submatrices. If I have matrix for more vaiables, that goes into exponential growth of recursive calculation of augmented submatices, and it becomes slow.

Hm, maybe I could do something about it inside MathPHP...

Beakerboy commented 6 years ago

If you have the knowledge, willingness, and time to refactor the inverse method to an algorithm that is faster, we would love to have it. Thanks for any help!

markrogoyski commented 6 years ago

@sbrbot Can you explain what is the optimization you think the MathPHP inverse is not using?

The basic strategy is to reduce to RREF and then find the inverse. We use Gaussian elimination to reduce to REF, and then continue on all the way to RREF. My understanding of the terms is like this:

Based on my understanding that is what we are doing. Is there something else in the meaning of Gauss-Jordan elimination that we are not doing?

sbrbot commented 6 years ago

Thanks Mark, you're right that's what I need. I see now, MathPHP does use Gauss-Jordan elimination! I was wrong, did not see it on first sight. However, reducing matrix on RREF means that matrix should have only one's as diagonal elements (https://en.wikipedia.org/wiki/Row_echelon_form). MathPHP does not normalize rows (dividing them with first non-zero element) with ->rref() fn.

Beakerboy commented 6 years ago

@sbrbot Do you have an example of some input and expected output that we can compare with what the function currently produces?

markrogoyski commented 6 years ago

@sbrbot Can you explain what you mean by not normalizing rows vs what it is currently doing?

Please take a look at the unit test cases for the rref method. There are 55 test cases just on rref directly which have an input matrix and the output RREF form. https://github.com/markrogoyski/math-php/blob/develop/tests/LinearAlgebra/MatrixDecompositionsTest.php#L300

Look at the data provider immediately below the test to see the inputs and expected outputs. All test cases were generated with authoritative third-party sources. Are there any results there that you think should be computed differently?

For reference, the be REF, a matrix must satisfy:

Then, to be in RREF, a matrix must satisfy:

If you can provide some sample data and output where MathPHP's rref method is not doing that I can take a look at it. If you think there are some other criteria that the output matrix must fulfill in order to be considered in RREF, please let me know what you think those are and what the documented source of the definition is.

I'm happy that people look thoroughly at MathPHP's code and hold it up to scrutiny. I want the code to be correct. That's why I've written thousands and thousands of tests. However, in this case, I'm not convinced there is any issue. But if you can provide some data that shows otherwise I will definitely take a look!

Thanks.

sbrbot commented 6 years ago

Dear Mr Rogoyski

First of all, thank you very much for your time. I really appreciate it and your effort.

I was dealing with some cases where Multivariate Linear Regression prediction was needed (actually some prediction but MLR was the easiest one) and I decided to implementit in one small PHP project. Actually, it is a small tracking program which parses data about second hand cars and motorcycles on market (from some web site), grabs its attributes like mileage, engine volume, engine power, etc. and calculates what is predicted price according to these input values in its group (in order to show if the vehicle is ovepriced or underpriced comparing to others). At the begining I found one library for multivariate linear regression calculation which uses it's own Matrix library which is total disaster ( https://github.com/mnshankar/linear-regression). The MLR calculation in my case lasted from several minutes for almost half an hour. Ok, that was on my local limited machine, not on big server of my hosting service but nevertheless, it is not efficient processing.

The whole wizdom here (in calculating Multivariate Linear Regression coefficients) is calculating inverse of input Matrix. In my case I have few hundreds of vehicles and dozen of its variables so it's huge matrix. Therefore I needed efficient code for this calculation. My first try with mentioned library worked but lousy. So I decided to find another or develop my own library which will exploit Gauss-Jordan elimination in orer to improve processing. In the same time I was googling in order to find other implementations for this purpose and (among others) found MathPHP.

Now I studied the MathPHP code for matrix calculations and it works great. No errors noticed from my side!

But I see that MathPHP in calculating inverse Matrix firstly - makes REF like (using Gauss-Jordan elimination)

A x x x 0 B x x 0 0 C x 0 0 0 D

During this process there are rows swappings (if row has 0 in diagonal element)

And afterwards goes in reverse order to make RREF like:

1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

My code does not have forward and backward loops but one loop calculating this way:

--pass#1

A x x x /A x x x x x x x x x x x x

1 x x x b x x x - row1b c x x x - row1c d x x x - row1*d

--pass#2

1 a x x 0 B x x /B 0 b x x 0 c x x

1 a x x - row2a 0 1 x x 0 c x x - row2c 0 d x x - row2*d

--pass#3

1 0 a x 0 1 b x 0 0 C x /C 0 0 d x

1 0 a x - row3 a 0 1 b x - row3 b 0 0 1 x 0 0 d x - row3*d

--pass#4

1 0 0 a 0 1 0 b 0 0 1 c 0 0 0 D /D

1 0 0 a - row4a 0 1 0 b - row4b 0 0 1 c - row4*c 0 0 0 1

--

1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

So it goes through diagonal and for that row normalizes it (divides with column coeff to reach 1 in diagonal) and immediately reduces both (rows above and bellow) to have zeros in column.

With this approach I made some smaller processing improvements. I tested matrix inversions with my rudimentary (naked) library and it reaches up to 4 times beter speed than MathPHP's inverse() for smaller matrices with 1-10 rows, 2-3 times better speed for matrices wih 11-20 rows, and less for bigger matrices. So that's nothing special, taking into account that MathPHP's implementation class is not naked and has a lot of checks, that's almst the same thing.

Maybe I will make same changes in MathPHP code in order to implement some of my code chunks from here and you'll see if it is acceptable for you (in github).

Nevertheless of Matrix inversion, I will try to do something about Multivariate Linear Regression in MathPHP.

Also, for some other project I need Discrete Fourier Transform (DFT) because I need to find frequency spectra (periodicity of some parameters) and for that I need Matrix which will be able to manipulate with complex elements. Currenty MathPHP's Matrix class works just with ordinary (float) elements (not Complex ones). So what would be better approach to extend current Matrix class to check for every element is instanceOf Complex and then apply multipy(), add() and other methods of Complex class (instead of simple + - * / operators) or to create complete new ComplexMatrix class which will be able to cope with Complex elements?

P.S. What a paculiar but interesting math notation/naming for variables used inside code :-) I develop in NetBeans and it does not like it too much (reports a lot of syntax errors).

Sincerely

Stjepan Brbot

On Thu, Dec 7, 2017 at 8:27 AM, Mark Rogoyski notifications@github.com wrote:

@sbrbot https://github.com/sbrbot Can you explain what you mean by not normalizing rows vs what it is currently doing?

Please take a look at the unit test cases for the rref method. There are 55 test cases just on rref directly which have an input matrix and the output rref form. https://github.com/markrogoyski/math-php/blob/develop/tests/LinearAlgebra/ MatrixDecompositionsTest.php#L300

Look at the data provider immediately below the test to see the inputs and expected outputs. All test cases were generated with authoritative third-party sources. Are there any results there that you think should be computed differently?

For reference, the be REF, a matrix must satisfy:

  • all non zero rows are above any zero rows
  • the leading coefficient of a non zero row is always to the right of the leading coefficient of the row above it.

Then, to be in RREF, a matrix must satisfy:

  • Be in REF
  • Every leading coefficient is 1
  • The leading coefficient is the only non zero value in its column

If you can provide some sample data and output where MathPHP's rref method is not doing that I can take a look at it. If you think there are some other criteria that the output matrix must fulfill in order to be considered in RREF, please let me know what you think those are and what the documented source of the definition is.

I'm happy that people look thoroughly at MathPHP's code and hold it up to scrutiny. I want the code to be correct. That's why I've written thousands and thousands of tests. However, in this case, I'm not convinced there is any issue. But if you can provide some data that shows otherwise I will definitely take a look!

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/markrogoyski/math-php/issues/260#issuecomment-349885103, or mute the thread https://github.com/notifications/unsubscribe-auth/AB06M1JCB1Z48i6eLDkkdGkOUSxLNtgtks5s95NFgaJpZM4QqtYG .

Beakerboy commented 6 years ago

If you need a matrix of complex numbers, you can use an instance of the ObjectMatrix, and populate the elements with Complex objects. The object matrix is able to be populated with any PHP object that implements the objectArithmetic interface. This means that the object supports addition subtraction and multiplication. With the object matrix, one can do matrix algebra on the Polynomial, Complex, or Rational number object in MathPHP. I am working 128 bit integer and floating point objects as well.

Edit: It looks like the ObjectMatrix is only for square matrices at this time. It’s used internally to ease calculation of Eigenvalues.

markrogoyski commented 6 years ago

The ObjectMatrix is a work in progress, but eventually could be used to find things like the inverse of a matrix composed of ComplexNumbers once all the matrix methods are reimplemented within ObjectMatrix to use ObjectArithmetic.

Speaking of SquareMatrix, I remember we talked about removing that, and you convinced me we should do it. I have it in the backlog of items to do, but I don't remember what the reason for removing it was.

sbrbot commented 6 years ago

There is SquaredObjectMatrix which should suffice ( https://en.wikipedia.org/wiki/DFT_matrix). This is squared matrix with Complex objects and there's no need for inverse() method.

On Fri, Dec 8, 2017 at 8:24 AM, Mark Rogoyski notifications@github.com wrote:

The ObjectMatrix is a work in progress, but eventually could be used to find things like the inverse of a matrix composed of ComplexNumbers once all the matrix methods are reimplemented within ObjectMatrix to use ObjectArithmetic.

Speaking of SquareMatrix, I remember we talked about removing that, and you convinced me we should do it. I have it in the backlog of items to do, but I don't remember what the reason for removing it was.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/markrogoyski/math-php/issues/260#issuecomment-350192605, or mute the thread https://github.com/notifications/unsubscribe-auth/AB06M4XHlKIt9SNhHn4QKot8lzTPpv7qks5s-OQTgaJpZM4QqtYG .

Beakerboy commented 6 years ago

I think my opinion was to move all the square methods to a trait. Then, if the Factory determines the matrix it square, it can add the trait to the class. This would prevent us from having square and rectangular versions of every matrix type.

keikland commented 6 years ago

New to this lib, but getting into it now. Multiple linear regression is a feature I certainly need. The discussion on implementation thus far has been based on the textbook approach, which for serious use is slow and can be low accuracy. Since full-rank X'X is a positive definite symmetric matrix, a QR (or Cholesky, preferably scale-stabilized) factorization is the normal inversion/equation solving mechanism.

There are other methods too for extended functionality, like SVD and rank-deficient Moore-Penrose inverse (with solution vector norm minimization) that could be added as a selectable solution class (perhaps including Gauss-Jordan elimination/explicit inverse, and complex versions). sbrbot's advice to use the DFT_matrix approach is highly unstandard, probably numerically unstable (e.g. for calculation of normal equations) and not efficient for general use.

In most cases, a sparse implementation in column-major format speeds up calculations dramatically and also opens the way into bias-reducing extensions (2 and 3-stage least squares), confidence intervals and advanced hypothesis testing.

Anyway, for the first step, definitely go with a sparse model and Cholesky factorization (QR up front if an efficient PHP code class can be found). Generally, for the select cases where access to the explicit inverse is useful, e.g. normal equations, it can always be computed in an additional step.

Kjell

markrogoyski commented 6 years ago

Hi @keikland Thanks for your interest in MathPHP.

Some of the next things I am going to investigate implementing are matrix QR decomposition and SVD. Basically, build up the basic tools that can be used and applied to implement other things. If you have thoughts or opinions on prioritizing features please voice your opinion. And if you can offer any expertise in some of the complex matrix operations, don't hesitate to contribute if you'd like to.

Mark

keikland commented 6 years ago

Hi Mark, QR would be great, and I agree that it is essential to establish a strong/fast basic toolset with key matrix-matrix and matrix-vector operations. I don't know MathPHP well enough yet, but the structure of the lib suggests that much of the logic is already present. It would be great to interface with a fast BLAS library implementation, but then the issue of packing-analysis-repacking loop arises, much like CPU vs GPU. Probably a wrong choice, but should be considered.

Array operations already seem to be a lot faster with PHP 7 than earlier version, and with clang-like features and performance coming in v7.3 or v8 it may be that it a numerical index native PHP array may actually become quite good. Compatibility and maintainability would also benefit. What I am basically saying is that a very optimized pure PHP implementation may be the best way to go, but already now creating a modular structure to be able to easily implement parallel processing (e.g. matrix-matrix multiplication and Shur-complement like factorization.)

On the feature set, I would suggest to include (provision for) automatic branching to rank-deficient operations, if detected by the "naive" QR factorization. That would mean recasting as Moore-Penrose inverse and additional matrix-matrix multiplications. However, the matrix/vector operations are simple enough. My preferred reference here is the old Rao book on Statistical Inference, which is also great as a guide for implementing parametric (composite/multi-factor) hypothesis testing.

Almost forgot to mention that fast matrix transposition and/or explicitly maintaining a separate row-major version of key sparse matrices can be key to performance. Memory is cheap these days, but PHP's new arrays also require a lot less memory than before.

As a side-node, it is interesting that, with the focus these days in with machine learning, multiple linear regression is not more in focus. It is basically the earliest form of big-data inference. Many neural network implementations are incredibly rank-deficient, but the resolution of rank-deficiency is not transparent. Usually it is ignored, and it translates into very weak forecasts if underlying factors change only slightly. With a rank-deficient multiple regression approach, whether alone or integrated into a ML framework, it is both easier to diagnose and has a transparent principle "cost of error" resolution mechanism.

In summary: PHP arrays+QR+fast matrix/vector operations(+rank-deficient MLR+parallelism). These features give the foundations for two/tree-stage least squares extensions and maximum likelihood estimation too. Note also that there is a lot of focus in the academic and advanced user communities on the numerical accuracy of the algorithms. It would be very good if MathPHP could include some of the standard test sets both for development purposes and to document reliability.

Kjell

markrogoyski commented 6 years ago

Hi Kjell,

Thanks again for your interest and comments about improving MathPHP.

The basic premise of the library is this would all you would need to include to get all the common math functionality you would use, so the goal is to do everything in PHP without other external dependencies.

You mentioned some standard tests. Can you provide some links or resources for those tests? MathPHP has 100% unit test coverage and strives for correctness by going beyond 100% test coverage on important features. I'd be happy to add any additional tests if you can provide a source or pull request.

Thanks again! Mark

keikland commented 6 years ago

Hi Mark, For starters, NIST datasets. https://www.estima.com/resources_nistresults.shtml gives a link to these resources and also shows how one vendor addresses the issue.

R also has a decent description of the range of relevant models/applications and several datasets at https://cran.r-project.org/web/views/Econometrics.html.

Since inversion/factorization/decomposition is at the root of the solution to most of these problems, including applications in physical sciences, the factorization of tough test matrices is an issue in itself. Importantly, that includes the proper identification of singularity. NIST again, the the MatrixMarket collection at https://math.nist.gov/MatrixMarket. Browse by collection; or the top ten list. I do not have a php version of an mm read routine, but the example c code is easy enough to understand.

Kjell

From: Mark Rogoyski [mailto:notifications@github.com] Sent: Sunday, 25 February, 2018 19:44 To: markrogoyski/math-php math-php@noreply.github.com Cc: keikland kjell.eikland@online.no; Mention mention@noreply.github.com Subject: Re: [markrogoyski/math-php] Multivariate Linear Regression support (#260)

Hi Kjell,

Thanks again for your interest and comments about improving MathPHP.

The basic premise of the library is this would all you would need to include to get all the common math functionality you would use, so the goal is to do everything in PHP without other external dependencies.

You mentioned some standard tests. Can you provide some links or resources for those tests? MathPHP has 100% unit test coverage and strives for correctness by going beyond 100% test coverage on important features. I'd be happy to add any additional tests if you can provide a source or pull request.

Thanks again! Mark

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkrogoyski%2Fmath-php%2Fissues%2F260%23issuecomment-368333424&data=02%7C01%7C%7C54c5d65797f64b94c98d08d57c800164%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636551811578481059&sdata=J5b1gVK0eAi9E5eFEtu61c%2FvxodFAUT5jwUmooK3iig%3D&reserved=0, or mute the threadhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUTZWCkf0kzIbaYoDJj_4NVYC1PXJBvBks5tYanvgaJpZM4QqtYG&data=02%7C01%7C%7C54c5d65797f64b94c98d08d57c800164%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636551811578481059&sdata=n%2F9JdCWmBAyRlW0%2BFyHu2S%2FX%2BRPLJtVMTXHjDyvoBdo%3D&reserved=0.