quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Eliminate all possible explicit loops #16

Closed magsol closed 8 years ago

magsol commented 8 years ago

In favor of vectorized operations (via numpy and scipy).

magsol commented 8 years ago

At first glance, the explicit loop in lines 44-45 can definitely be eliminated and replaced with a call to sla.norm(), but on further inspection it looks like the entire op_VCTl2diff function can be replaced by this method call.

For some vector x, sla.norm(x) computes the L2 norm. However, you can also insert a full algebraic expression such as x - y, or the L2 difference, inside the argument. For example:

sla.norm(x - y)

computes the difference in L2 norm between the two vectors. Thus, you can replace the entire op_VCTl2diff function with this single line. As such, please delete that function and its use on line 115 with the equivalent call to sla.norm. This will hugely improve computation time for larger datasets.

MOJTABAFA commented 8 years ago

Thanks , It already tested and applied.

MOJTABAFA commented 8 years ago

I think the only remain loop which my cause some difficulties on large data set is "op_getResidual" , here the problem of specific "j" . However, since we're using this loop for a very limited iteration ( equal to idxs_n number of elements) I think this loop could not lead a big predicament for us. Am I right ?

magsol commented 8 years ago

Good question. I see a couple of routes, but one in particular if my understanding of the pseudocode is correct.

@LindberghLi , this is a question for you: in your pseudocode.txt file on line 35, you have a deflation step that consists of S = S - (u_new * v). Is u_new * v an outer product of the two vectors? Based on the corresponding code in R1DL.cpp (line 91), that appears to be the case; I just want to confirm.

Obviously, the outer product is only "partial", as some of the elements of v are withheld according to which R elements are selected; the others will be 0. But if we can, in essence, create a v_temp where only the R top indices are retained and all others set to 0, we could create a matrix UV that is the outer product of u_new and v_temp, and then perform the update as S = S - UV.

XiangLi-Shaun commented 8 years ago

@magsol Right,

for

S = S - (u_new * v).

as S is of dimension T*P, so is the output of:

u_new*v

thus u_new*v is the outer (tensor) product of the two vectors.

magsol commented 8 years ago

@LindberghLi , thanks for the confirmation.

@MOJTABAFA:

As a consequence of Xiang's comments, we can refactor out the two loops in op_getResidual. There are a couple things to take into account:

As such, we have to be careful when we compute the outer product of u and v, such that only the top R elements of v are used; all others should be 0.

Here's what you can do to eliminate the loops:

  1. Define a new vector, v_sparse, that is the same length as v and contains all 0s.
  2. Transfer the top R elements of v over to their correct positions in v_sparse. You can do this using the first R elements of the idxs_n array: v_sparse[idxs_n[:R]] = v[idxs_n[:R]]. If this code is unclear, let me know.
  3. Subtract off the outer product of u and v_sparse from S: S = S - np.outer(u, v_sparse).

That's it: both loops are gone, replaced by 3 lines of code (plus a return statement). Commit this change as soon as you can.

MOJTABAFA commented 8 years ago

@magsol at initialization phase v is a vector with (1,P) dimension, However immediately after while v is already dotted by S hence now the dimensions of v are changed to (T,P). now 2 questions:

  1. is our dot code correct ?
  2. should I set the v_sparse dimensions with T and P ? if yes, in this case the v_sparce in no more a vector, but a matrix.
MOJTABAFA commented 8 years ago

even in this case again an error is occuring as follows :

File "R1DL.py", line 56, in op_getResidual v_sparce[idxs_n[:R]] = v[idxs_n[:R]] ValueError: shape mismatch: value array of shape (7902,) could not be broadcast to indexing result of shape (7902,39510)

magsol commented 8 years ago

For some reason the vector v on the right is a matrix; it should not be a matrix. Please do some debugging.

That's what "shape mismatch" means: you're trying to put a matrix (on the right) into a vector (on the left).

iPhone'd

On Dec 7, 2015, at 13:56, MOJTABAFA notifications@github.com wrote:

even in this case again an error is occuring as follows

File "R1DL.py", line 56, in op_getResidual v_sparce[idxs_n[:R]] = v[idxs_n[:R]] ValueError: shape mismatch: value array of shape (7902,) could not be broadcast to indexing result of shape (7902,39510)

— Reply to this email directly or view it on GitHub.

MOJTABAFA commented 8 years ago

@magsol Sorry, forget about this post, It was my fault . That's correct yet . still the V dimensions after dot operation are as before. I'm checking the Error right now. Thanks

magsol commented 8 years ago

Check your capslock please.

iPhone'd

On Dec 7, 2015, at 14:07, MOJTABAFA notifications@github.com wrote:

@magsol SORRY FORGET ABOUT THIS POST , IT WAS MY FAULT. THAT'S CORRECT AND STILL THE V DIMENSION ARE AS BEFORE. i'M CHECKING THE ERROR NOW , THANKS

— Reply to this email directly or view it on GitHub.

MOJTABAFA commented 8 years ago

Thanks I've edited the post.

MOJTABAFA commented 8 years ago

@magsol
That's already checked and test . I'll push it on code now. However, still z out put is not an accurate answer.

magsol commented 8 years ago

I'm not sure what this comment means?

MOJTABAFA commented 8 years ago

I mean still our z matrix out put is as follows : z = [[ -6.91418494e-01 -7.37880061e-01 -7.03670065e-01 ..., 7.55630072e-01 4.80989673e-01 4.10549214e-01] [ -2.25265078e-01 -2.64283766e-01 -2.56007025e-01 ..., 9.40155527e-02 2.15154778e-01 -2.03920218e-04] [ 6.51218469e-01 6.92401988e-01 6.58961860e-01 ..., -8.74613251e-02 -4.46588839e-01 -2.99345990e-01] ..., [ 1.19219153e-01 9.09684072e-02 6.03731311e-02 ..., 5.66452406e-02 -3.16946257e-02 -1.68322646e-01] [ -5.14161746e-02 -2.65044110e-02 -8.81680555e-02 ..., 1.19447575e-01 4.21902083e-03 1.20995168e-01] [ 7.56348162e-02 4.24608032e-02 1.81032211e-02 ..., 3.16473653e-02 -3.99591561e-02 -2.20251116e-04]] totoalResidual = 6823.54743273

However, As I know the Z matrix should be a sparse matrix( it's also clear in expected test results in test1 and test2 folders).

XiangLi-Shaun commented 8 years ago

@MOJTABAFA values in v (and correspondingly in z) were not explicitly set as sparse, but is the result of:

idxs_n = op_selectTopR(v, R) u_new = np.dot(S[:, idxs_n], v[idxs_n]) u_new = u_new / sla.norm(u_new, axis = 0)

where we got a basis that could potentially supporting a sparse coefficient, then by:

v = np.dot(u_old, S)

the v "should" become a sparse vector. Thus values like "-2.20251116e-04" is totally acceptable as the result. For the incorrect results like " -6.91418494e-01", we'll further investigate it upon finishing the code optimization.

MOJTABAFA commented 8 years ago

@LindberghLi Ok, Thanks for your explanation.

magsol commented 8 years ago

Shouldn't that operation be 'v = np.dot(u_new, S)', not u_old? On Mon, Dec 7, 2015 at 17:22 LindberghLi notifications@github.com wrote:

@MOJTABAFA https://github.com/MOJTABAFA values in v (and correspondingly in z) were not explicitly set as sparse, but is the result of:

idxs_n = op_selectTopR(v, R) u_new = np.dot(S[:, idxs_n], v[idxs_n]) u_new = u_new / sla.norm(u_new, axis = 0)

where we got a basis that could potentially supporting a sparse coefficient, then by:

v = np.dot(u_old, S)

the v "should" become a sparse vector. Thus values like "-2.20251116e-04" is totally acceptable as the result. For the incorrect results like " -6.91418494e-01", we'll further investigate it upon finishing the code optimization.

— Reply to this email directly or view it on GitHub https://github.com/quinngroup/pyspark-dictlearning/issues/16#issuecomment-162686120 .

iPhone'd

XiangLi-Shaun commented 8 years ago

@magsol u_old is assigned by the values in u_new before v = np.dot(u_old, S)

magsol commented 8 years ago

Ok, thanks--I'm mobile right now and don't have access to the code :)

iPhone'd

On Dec 7, 2015, at 19:48, LindberghLi notifications@github.com wrote:

@magsol u_old is assigned by the values in u_new before v = np.dot(u_old, S)

— Reply to this email directly or view it on GitHub.