Consensus of row vs column -wise operations

quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark

Apache License 2.0

1 stars 1 forks source link

Consensus of row vs column -wise operations #51

Closed magsol closed 8 years ago

magsol commented 8 years ago

Please refer to the R1DL pseudocode.

We need to make sure operations are consistent. To that end, we define

P is the number of observations
T is the number of features (or the dimensionality of the observations)
S, the input matrix, should be observations-by-features, or P by T.

Orthogonal to this are the primary steps the program takes.

Whiten and normalize the observations (rows) of S at the start.
v is of length P, which means this operation should be v = S * u_old.
u is of length T, which means this operation should be u_new = v * S
The outer product of u and v should be performed in the order outer(v, u_new) to generate a P by T matrix of residuals to update S.

MOJTABAFA commented 8 years ago

@magsol What about Z file ? because based on what we defined for Z matrix it could be a MxP matrix. considering our big data sample which has 170x39,850 dimensions, we can say that our P is 170 here, Thus our Z matrix would be 170xM. suppose that M = 100 then our Z matrix would be 170x100 and our D matrix would be 39,850x100 which are not correct. Should I change the Z dimensions to TxM instead of PxM ?

MOJTABAFA commented 8 years ago

@magsol
The second point is in v[indices] : Here also should we set the V vector like what we discussed in our meeting ? if yes in pyspark code after invoking the op_select should we do as follows :

indices = op_selectTopR(v, R)
            temp_v = np.zeros(v.shape)
            temp_v[indices] = v[indices]
            v = temp_v

but my problem is mostly in next part where we should broad cast the indices and vector. Here the question is _" should we broad cast the modified v and all it's indices or just some indices and vector elements which have been selected by _opselectTopR function ?

magsol commented 8 years ago

If we set the non-topR values to 0, we can just broadcast v and not the indices. We do the full vector-matrix multiplication with the 0s in the correct elements.

magsol commented 8 years ago

As per our earlier discussions, we'll be making row-vs-column-wise operations a command-line flag (see ticket #52).