Sparse matrix in LinearRegression

neurospin / pylearn-parsimony_history

Sparse and Structured Machine Learning in Python

BSD 3-Clause "New" or "Revised" License

0 stars 1 forks source link

Sparse matrix in LinearRegression #25

Open duboism opened 10 years ago

duboism commented 10 years ago

Fouad and I recently wanted to use the LinearRegression loss function with a sparse X matrix (the identity). This is clearly a corner case but it really helps in our case.

Unfortunately, the method L uses np.linalg.svd which doesn't work with sparse matrix.

Searching a bit, it appears that scipy 0.9.0 (the version under ubuntu 12.04) doesn't deal well in this case. Apparently this is better supported with newer versions. Édouard has a small code snippet to perform SVD with old numpy versions.

Maybe we could introduce a function in utils.math that would intelligently wrap the right call to numpy/scipy (depending on the version) or to Édouard's trick or maybe to your FastSVD function.

tomlof commented 10 years ago

Yes, the package currently requires all matrices to be 2D 64-bits arrays. We should think about how to seamlessly allow other types of input data, and especially sparse matrices!

SVD for sparse matrices is included in newer versions of scipy, as you say, but we do not have access to that..;-/

What you could do, is for instance to implement a new loss function, something like SquaredSumError, or something like that, that would represent the function f(x, y) = ||x - y||²_2.

duboism commented 10 years ago

@tomlof we created the SquaredSumError loss function so right now it's not longer an issue for us.

However we should probably think about using sparse matrix where possible. As we mentionned in other issues we could simply aim for newer scipy version (with sparse SVD).

tomlof commented 10 years ago

Ok, great! Feel free to add that to the library. You can put it in functions.losses or functions.penalties, whatever makes the most sense.

Actually, some penalties could be used as loss functions, and vice versa, so the distinction between losses and penalties is not clear-cut. Perhaps we should think about a better way to split them up in different modules?