piskvorky / sparsesvd

Python wrapper around SVDLIBC, a fast library for sparse Singular Value Decomposition
55 stars 18 forks source link

Invalid shape in axis #3

Open moustaki opened 11 years ago

moustaki commented 11 years ago

Hello!

I was wondering about the following behaviour, which made one of my tests break in a recent sparsesvd update.

>>> C = numpy.array([[ 0. ,  0.7,  0. ,  0. ],
                                 [ 1. ,  0. ,  0.7,  0. ],
                                 [ 0. ,  0.7,  0. ,  1. ],
                                 [ 0. ,  0. ,  0.7,  0. ]])
>>> K = scipy.sparse.csc_matrix(C)
>>> sparsesvd(K, 2)
ValueError: Invalid shape in axis 0: 0.

Not sure what is going on here - changing any value in this matrix seems to make it work.

Best, y

piskvorky commented 11 years ago

I can reproduce this in 0.1.9, thanks for the report y.

@alepulver this error doesn't happen in 0.1.8 -- any idea where the regression's coming from?

alepulver commented 11 years ago

This seems to happen with any matrix equivalent to numpy.diag(a,b,b,a) when a and b are lower than 20. I believe svdLAS2SA could be returning something different in these cases. What did it return in 0.1.8 where it worked?

Edit: found it. In these cases, srec.d is 0 (the length of "s". as returned by SVDLIBC). What should we do with the result?

piskvorky commented 11 years ago

0.1.8 returns

(array([[ 0.32372233,  0.4780341 ,  0.79283656,  0.19518564],
       [-0.19518564,  0.79283656, -0.4780341 ,  0.32372233]]),
 array([ 1.30002747,  1.30002747]),
 array([[ 0.36771077,  0.60121131,  0.36249528,  0.60986139],
       [ 0.60986139, -0.36249528,  0.60121131, -0.36771077]]))

(same machine)

alepulver commented 11 years ago

Could you please confirm that svdResult->d is 2, as it should be?

Edit: don't worry, I don't think that's the case. I'm getting d=0 and S=[0,0] from SVDLIBC. Either I'm calling it incorrectly, or something very wrong is going on.

Were there any special compilation flags I removed when switching to Cython?

alepulver commented 11 years ago

When enabling SVDVerbose in svdlib.c, I get the following output when it works:

In [2]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,15,15.1,1])), 2)
SOLVING THE [A^TA] EIGENPROBLEM
NO. OF ROWS               =      4
NO. OF COLUMNS            =      4
NO. OF NON-ZERO VALUES    =      4
MATRIX DENSITY            =  25.00%
MAX. NO. OF LANCZOS STEPS =      4
MAX. NO. OF EIGENPAIRS    =      2
LEFT  END OF THE INTERVAL = -1.00E-30
RIGHT END OF THE INTERVAL =  1.00E-30
KAPPA                     =  1.00E-06

NUMBER OF LANCZOS STEPS   =      4
RITZ VALUES STABILIZED    =      4
SINGULAR VALUES FOUND     =      2

And this when it doesn't.

In [3]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,15,15,1])), 2)                                                         
SOLVING THE [A^TA] EIGENPROBLEM
NO. OF ROWS               =      4
NO. OF COLUMNS            =      4
NO. OF NON-ZERO VALUES    =      4
MATRIX DENSITY            =  25.00%
MAX. NO. OF LANCZOS STEPS =      4
MAX. NO. OF EIGENPAIRS    =      2
LEFT  END OF THE INTERVAL = -1.00E-30
RIGHT END OF THE INTERVAL =  1.00E-30
KAPPA                     =  1.00E-06

NUMBER OF LANCZOS STEPS   =      4
RITZ VALUES STABILIZED    =      0
SINGULAR VALUES FOUND     =      0

Do you have any ideas, @piskvorky ?

piskvorky commented 11 years ago

Heh, might well be connected to the "fewer factors than requested" discussed elsewhere...

@alepulver Is the output deterministic for you (0.1.8 works, 0.1.9 doesn't)? Or does it fail intermittently even on the same release?

alepulver commented 11 years ago

Sorry for the delay. I don't know if it's related, but at least 0.1.8 behaves the same in my machine. The only difference is that it doesn't throw an exception, but returns an array with invalid shape (so it's SVDLIBC returning no singular values).

This seems a SVDLIBC problem.

In [1]: import numpy as np, sparsesvd, scipy.sparse

In [2]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,2,2,1])), 2)
Out[2]: 
(array([], shape=(0, 4), dtype=float64),
 array([], dtype=float64),
 array([], shape=(0, 4), dtype=float64))

In [3]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,2,2.1,1])), 2)
Out[3]: 
(array([[ -2.58143374e-20,   8.98751972e-16,  -1.00000000e+00,
         -1.32169408e-17],
       [ -6.77626358e-21,  -1.00000000e+00,  -8.16013923e-16,
         -2.77555756e-17]]),
 array([ 2.1,  2. ]),
 array([[ -5.42101086e-20,   9.43689571e-16,  -1.00000000e+00,
         -2.77555756e-17],
       [ -1.35525272e-20,  -1.00000000e+00,  -7.77156117e-16,
         -5.55111512e-17]]))
piskvorky commented 11 years ago

Aha, ok, thanks. I'll have to investigate more thoroughly then :)

I'll check to see if vanilla SVDLIBC behaves the same way too (no Python), and then it's debugging time...

piskvorky commented 11 years ago

I debugged, and the problem seems to be somewhere deep inside LAS2. The error bounds (bnd array coming out of lanso) are rubbish in face of algebraic multiplicity, although the ritz values are fine. The large bounds lead to ritz values being classified as "unstabilized" => errors later on.

This appears to be the root of all the other reported issues with sparsesvd as well: #1, #3 .

I tried comparing SVDLIBC (=Doug Rohde's version) against SVDPACKC (the original Michael Berry's version), but didn't notice any relevant difference in the code. So this possibly happens with SVDPACKC as well, which is really strange.

piskvorky commented 11 years ago

I sent an email to Doug Rohde, hopefully he'll shed some light.

piskvorky commented 11 years ago

Doug's reply:

"I haven't worked on SVDLIBC in years. And it was a port of someone else's code, so I never fully understood the majority of it. Mostly I was fixing bugs and improving memory management. So I probably can't offer that much about the algorithm, other than trying to read and understand the code."

alepulver commented 11 years ago

Thank you for spending so much time on this issue. I'm not really into numerical methods, so I may not be able to help. But maybe there is a paper about the algorithm, or newer ones. As for related Python modules I've found:

Divisi2 (which doesn't support Python 3.x) http://csc.media.mit.edu/docs/divisi2/

sklearn (which also includes recent references to papers) http://scikit-learn.org/stable/modules/decomposition.html

mlpy (it's not clear if it supports sparse matrices efficiently) http://mlpy.sourceforge.net/docs/3.5/dim_red.html#principal-component-analysis-pca

They mention PCA, but AFAIK it's related to the SVD of the normalized and rescaled matrix (or the covariance matrix), so I thought it might be useful.

piskvorky commented 11 years ago

Divisi2 uses SVDLIBC too, and therefore suffers from the same problems.

I already created and maintain a package for large-scale sparse SVD (gensim), so this is not a big issue for me. But to tell the truth, I am not thrilled about maintaining sparsesvd if it contains bugs that nobody understands and nobody can fix...

Would you like to become the principal maintainer of sparsesvd, @alepulver ?

piskvorky commented 11 years ago

Or let's put it differently -- the person who can successfully resolve this issue will be crowned the king of sparsesvd :)

alepulver commented 11 years ago

I don't think I can debug it. Maybe an exception could be raised in those cases, or adding a tiny random noise to make it work (at least close to the real result). What do you think of this "hacky" solution?

piskvorky commented 11 years ago

I don't think that's a good idea @alepulver . Partly because we don't even know what "those cases" are.

I'll try to delve into the details and fix this, but I can't promise a date. For now (and the past 10 years?), it looks like the library will remain broken.

pminervini commented 11 years ago

Is this behaviour related ? I'm getting the very same (weird) result (no eigenvalues, empty U) with vanilla svdlibc (now I'm looking around for patches):

import numpy, scipy.sparse from sparsesvd import sparsesvd smat = scipy.sparse.csc_matrix((2, 2)) smat[1,1]=1.0 u, s, vt = sparsesvd(smat, 2) Traceback (most recent call last): File "", line 1, in File "sparsesvd.pyx", line 39, in sparsesvd.sparsesvd (sparsesvd.c:2239) File "stringsource", line 247, in View.MemoryView.array_cwrapper (sparsesvd.c:5951) File "stringsource", line 147, in View.MemoryView.array.cinit (sparsesvd.c:4887) ValueError: Invalid shape in axis 0: 0. u, s, vt = sparsesvd(smat, 1) Traceback (most recent call last): File "", line 1, in File "sparsesvd.pyx", line 39, in sparsesvd.sparsesvd (sparsesvd.c:2239) File "stringsource", line 247, in View.MemoryView.array_cwrapper (sparsesvd.c:5951) File "stringsource", line 147, in View.MemoryView.array.cinit (sparsesvd.c:4887) ValueError: Invalid shape in axis 0: 0. u, s, vt = sparsesvd(smat, 0) Traceback (most recent call last): File "", line 1, in File "sparsesvd.pyx", line 39, in sparsesvd.sparsesvd (sparsesvd.c:2239) File "stringsource", line 247, in View.MemoryView.array_cwrapper (sparsesvd.c:5951) File "stringsource", line 147, in View.MemoryView.array.cinit (sparsesvd.c:4887) ValueError: Invalid shape in axis 0: 0. smat[0,0]=1.0 u, s, vt = sparsesvd(smat, 2) u array([[ 3.12553483e-04, 9.99999951e-01]])

piskvorky commented 11 years ago

@pminervini almost certainly; let us know if you find something 8)

pminervini commented 11 years ago

@piskvorky Maybe you can be intersted in GraphLab (for distributed environments) and GraphChi (for multicore machines): http://graphlab.org/graphchi/ they both include an highly efficient implementation of SVD (and other algorithms); AFAIK, GraphChi allows you to do the same things as SVDLIBC (possibly more efficiently), without the need to dive deep into ancient C code

piskvorky commented 11 years ago

@pminervini GraphiChi sounds great, thanks for the link! Also check out gensim for fast SVD (gensim targets topic modelling though, not collab filtering).

panamantis commented 9 years ago

I had the "ValueError: Invalid shape in axis 0: 0." . It happens when I blend two matrixes and the one has vector weights as 0. I added a slight offset 0.0000001 and it works now.