quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Nature of input matrices #50

Closed magsol closed 8 years ago

magsol commented 8 years ago

Xiang,

I'm having a difficult time understanding what the nature of the input data to the Spark implementation should be. Until now we've been testing with smaller datasets that are "tall and skinny", i.e. a large number of rows but a small number of columns. When I wrote the pseudocode for the method on this repo's wiki, it was my assumption that T >>> P, where T is the number of rows in S and P is the number of columns.

However, in the larger datasets (including the MOTOR dataset), the number of columns is significantly larger, hence my thought that the data needed to be transposed. But it seems like all the data in MOTOR is that way: "short-and-wide", or the number of rows is very small relative to the number of columns.

This presents some problems with the current implementation, since we distributed the data by rows. Since the data are dense, that means many fewer nodes, each with very large, very dense vectors. We'll need to rethink the implementation to take advantage of the short-and-wide input structure IF this is the case.

So we need some clarifications on the nature of the input data.

MOJTABAFA commented 8 years ago

@magsol @LindberghLi In my opinion, transposing the Data will not help us and we have to change something inside our program. Because : By transposing the Data our T vector length would be extremely large, though, our D matrix which already has the dimensions of (M,T) would be extremely larger than our Z matrix which has already the dimensions of (M,P) .( in our example the Z dimensions are (M,170) and the D one is (M,39850) ) and based on my understanding about the dictionary learning, one of the most important features in this method is having a small D ( as Dictionary , which includes a most important part of information which is needed for final output retrieval ) and a large (but sparse) matrix of Z. So in this case even if we could be able to get good and normalized answer for Z and D, It would not be compatible with Dictionary Learning concepts.

If we insist on using transposed data, there is a solution in my view,however, I don't know if it's possible theoretically or not ( I need Xiang's help in this issue ), the idea is swapping the dimensions of Z and D or changing the length of u_old and u_new ( redefining their length with P instead of T ).

Still need your idea and Xiand's one.

magsol commented 8 years ago

Swapping dimensions (i.e. transpose) won't work if that's not how the data are organized. Right now, my understanding is that the data points are in the rows--so the columns are the features of each data point. If that's true, we can't transpose the data.

MOJTABAFA commented 8 years ago

@magsol @LindberghLi : Here is a result of our python code for 700MB sample :

Training complete!
Writing output (D and z) files...

('z =', array([[ 0.30499039, -0.17296131, -0.42164603, ..., -0.43776458,
        -0.6151423 , -0.68651983],
       [-0.21261347,  0.48282181,  0.78018683, ...,  0.4594597 ,
         0.55608332,  0.46093213],
       [ 0.25016467, -0.26920662,  0.02505791, ..., -0.45282052,
        -0.57178337, -0.47108046],
       ..., 
       [-0.01320022, -0.01883174, -0.09951538, ...,  0.05035792,
        -0.07004709,  0.00764137],
       [ 0.0259702 ,  0.00944824, -0.02201768, ...,  0.02618121,
         0.03302137,  0.00399654],
       [-0.0133856 , -0.04231469, -0.00622333, ...,  0.04410436,
         0.00166308, -0.01278564]]), 

S dimensions : (39510, 170)
Z dimensions : (100, 170), 
D dimensions : (100, 39510)
('totoalResidual =', 9.534561218043013)

@magsol @LindberghLi As It's obvious the D size is much larger than Z one( not only in our spark code, but also in our python code). I don't know if it's theoretically correct in dictionary learning or not?

I already put the Z file here as follows , but the D file size is too big to be able posted here. Z31.txt I think it shows that our program should be revised and we should work on it immediately.

magsol commented 8 years ago

It does seem odd that D is roughly the same size as S, the input data. Is D supposed to be sparse? On Fri, Jan 8, 2016 at 14:42 MOJTABAFA notifications@github.com wrote:

@magsol https://github.com/magsol @LindberghLi https://github.com/LindberghLi : Here is a result of our python code for 700MB sample :

Training complete! Writing output (D and z) files...

('z =', array([[ 0.30499039, -0.17296131, -0.42164603, ..., -0.43776458, -0.6151423 , -0.68651983], [-0.21261347, 0.48282181, 0.78018683, ..., 0.4594597 , 0.55608332, 0.46093213], [ 0.25016467, -0.26920662, 0.02505791, ..., -0.45282052, -0.57178337, -0.47108046], ..., [-0.01320022, -0.01883174, -0.09951538, ..., 0.05035792, -0.07004709, 0.00764137], [ 0.0259702 , 0.00944824, -0.02201768, ..., 0.02618121, 0.03302137, 0.00399654], [-0.0133856 , -0.04231469, -0.00622333, ..., 0.04410436, 0.00166308, -0.01278564]]),

S dimensions : (39510, 170) Z dimensions : (100, 170), D dimensions : (100, 39510) ('totoalResidual =', 9.534561218043013)

@magsol https://github.com/magsol @xiang https://github.com/xiang As It's obvious the D size is much larger than Z one. I don't know if it's theoretically correct in dictionary learning or not?

I already put the Z file here as follows , but the D file size is too big to be able posted here. Z31.txt https://github.com/quinngroup/pyspark-dictlearning/files/83181/Z31.txt I think it shows that our program should be revised and we should work on it immediately.

— Reply to this email directly or view it on GitHub https://github.com/quinngroup/pyspark-dictlearning/issues/50#issuecomment-170103776 .

iPhone'd

MOJTABAFA commented 8 years ago

@magsol

  1. I already checked the program in my side , finding that the printed size of S here is the original shape and not the transposed one , so the transposed one would be (170, 39510).

Based on my understanding about Dictionary Learning the D Must Not Be Sparse. since the Dictionary is most meaningful part of data ( it should be as small as possible , but not sparse) Because for example in noise reduction purposes the dictionary will make the retrieved image together with a sparse matrix . D should be chosen such that it sparsifies the representation but should not be sparse. (As I know) , But in this area Xiang may help us more, because he knows better than me.

magsol commented 8 years ago

@MOJTABAFA and @LindberghLi, please see my email.

Essentially it sounds like I've confused the outputs D and Z. In reality, D should be very small and Z should be very big.

Are either of these outputs sparse? D should not be; as Mojtaba stated, it's the basis we're interested in, so sparsity usually isn't nearly as important as low-dimensionality. But what about Z?

MOJTABAFA commented 8 years ago

@magsol @LindberghLi , In my opinion, The Z matrix must be sparse and D should be selected in a way that it sparsifies the Z matrix. As it is clear in our small test sets answers the Z is now sparse in our pyspark code. and even the desired answers which already xiang determined for our test set approve this fact. However, As I emphasized before, Xiang could help us more in this area.

MOJTABAFA commented 8 years ago

@magsol since the nature of our input matrix is short and wide , should we use the CoordinateMatrix instead of RawMatrix ??

magsol commented 8 years ago

What CoordinateMatrix are you referring to?

MOJTABAFA commented 8 years ago

please correct me if I'm wrong : "RawMatrix" is a row-oriented distributed matrix in case of tall and skinny matrices, A "CoordinateMatrix" should be used when both dimensions of the matrix are huge and the matrix is very sparse. however, here our raw numbers are not so dense , Thus, for distributing the matrix in our program we have 2 choices : 1. distributing with "coordinatingMatrix" Which I dont know if it's expensive or not , or transposing the input matrix ( which was not useful in our case unless we can do some manipulation inside our program)

magsol commented 8 years ago

I'm still not sure where these matrices are coming from. What APIs are they from?

milad181 commented 8 years ago

Please find the log file of one of spark after running current code over MOTOR/6_MOTOR_whole_b_signals.txt file from HCP dataset.

MOTOR/6_MOTOR_whole_b_signals.txt HCP_spark_6MOTOR.txt

magsol commented 8 years ago

This lines seems to be the key:

ValueError: invalid literal for float(): 4354.61 3632.15 5220.98 4932.81 4111.09 3178.43 2549.94 5731.21 5060.16 3922.88 2745.32 4170.19 7045.37 6105.76 5199.74 4043.28 2655.23 4719.01 4308.34 6638.49 5953.33 5218.46 4464.02 4480.57 5859.11

The error is originating from line 26, reproduced below:

.map(lambda x: np.array(map(float, x.strip().split("\t")))) \

My first guess is that the columns in the text file are not delineated by tabs \t, but instead use spaces. In that case, the call to split("\t") would have no effect, and float would attempt to cast a lengthy string of multiple floating point values into a single number, resulting in the error.

It's a simple fix--change split("\t") to split(). In fact, it should have been the latter from the beginning--the default behavior is to split on any whitespace. By explicitly setting the split value to the tab character, we effectively ignore any other forms of whitespace (i.e. spaces).