quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

P2: Vector-matrix multiplication #44

Closed magsol closed 8 years ago

magsol commented 8 years ago

This Spark primitive is a little trickier than #20. This is due to the fact that the matrix will be row-distributed, but in vector-matrix multiplication, the columns of the matrix are multiplied.

Still, this can be done in a fairly straightforward manner.

  1. As in P1, broadcast the array u to be multiplied, e.g. sc.broadcast(u).
  2. Run a flatMap over the RDD.
  3. Each flatMap worker multiply its row of the matrix with the corresponding element of the broadcasted vector u.
  4. Each value of the resulting vector will be outputted, keyed by its element index (hence the need for flatMap instead of map).
  5. A reduceByKey will then sum up the values for each key, which correspond to the elements of the resulting vector u.
MOJTABAFA commented 8 years ago

@magsol Now I'm going to test the pyspark file in thunder , but during the execution the following error is appeared :

File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 216, in <module>
    S = S.apply(deflate, keepDType = True, keepIndex = True)
TypeError: apply() got an unexpected keyword argument 'keepDType'

the log file is as follows : testreport.txt

magsol commented 8 years ago

Can you figure out what it means?

iPhone'd

On Dec 24, 2015, at 14:19, MOJTABAFA notifications@github.com wrote:

@magsol Now I'm going to test the pyspark file in thunder , but during the execution the following error is appeared :

File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 216, in S = S.apply(deflate, keepDType = True, keepIndex = True) TypeError: apply() got an unexpected keyword argument 'keepDType' the log file is as follows : testreport.txt

— Reply to this email directly or view it on GitHub.

MOJTABAFA commented 8 years ago

@magsol is it in deflate function ? when we're calling it in S.apply() ,shouldn't we pass the "raw" to this function ?

magsol commented 8 years ago

No. Read the error message specifically. It's complaining about about unrecognized parameter names. Check the thunder documentation and see if you can figure out how to fix it.

iPhone'd

On Dec 24, 2015, at 15:27, MOJTABAFA notifications@github.com wrote:

is it in deflate function ? when we're calling it shouldn't we pass the "raw" to this function ?

— Reply to this email directly or view it on GitHub.

MOJTABAFA commented 8 years ago

Ok, let me check it.

MOJTABAFA commented 8 years ago

Actually the problem was just on spelling ! In 'KeepDType' T must be change into small 't' ! I'll correct it in main file.

MOJTABAFA commented 8 years ago

Now again I've tested the code on small test1 pattern , there z is much better than our previous z file ! now the z.txt is a sparse matrix , I'm going to test the bigger data set , the results of first small data set is as follows : Z.txt

magsol commented 8 years ago

Excellent work!

MOJTABAFA commented 8 years ago

@magsol After a long time execution in my lap top the z matrix of big data is now around 40MB, so it's not possible to put it here. However, the results are not similar to our small data sets, where the small data set answers were totally satisfactory, the big data result is suspicious. the parameters were : m=100, n=0.07 , e =0.01 and I didn't determine the Row and Col , some part of result is as follows :

-49.625731  -90.950085  -107.148851 -22.263390  -27.960949  -74.573206  -35.491131  -1.820312   106.864215  14.072931   171.595561  -93.018838  -3.851441   281.873055  212.104157  375.934794  -69.247916  -79.771974  -27.565432  335.760112  330.057942  255.645971  129.561707  23.689732   39.457266   338.431347  358.045253  -16.198390  211.919775  120.124855  66.542751   282.075863  378.395402  -94.307979  -2.779630   -11.584412  185.832728  279.141163  101.102970  -99.788754  -82.138987  99.249246   175.284746  101.319492  -94.943044  -29.128951  26.582609   -22.439812  16.184655   -30.774730  -42.659585  -28.481978  -76.469311  -137.889147 -69.109695  -74.959590  -93.705282  -121.603436 -149.070855 -55.650968  4.239743    -17.991413  -64.647887  -55.436329  -55.543341  -233.434969 -226.427454 -73.695304  -141.986671 -140.047461 -242.440411 -280.187721 -196.235706 89.043456   -22.907281  -11.296129  -80.976172  -138.241792 -352.324480 -125.427455 43.500121   -186.793748 -112.535951 -205.595161 -278.406738 -371.797682 -80.563537  48.026023   287.180729  178.378065  121.456420  87.679904   -109.481793 -114.439424 11.187516   282.435522  -78.271834  -78.662650  -222.487548 -393.253565 
magsol commented 8 years ago

After a long time execution in my lap top the z matrix of big data is now around 40MB, so it's not possible to put it there.

I'm not sure what that means.

However, the results are not similar to our small data sets, where the small data set answers were totally satisfactory, the big data results is suspicious. the parameters were : m=100, n=0.07 , e =0.01 and I didn't determine the Row and Col , some part of result is as follows :

How does it compare to what we see in the milestone 2 output?

MOJTABAFA commented 8 years ago

@MAGSOL

The answer for that 2 questions :

magsol commented 8 years ago

If the quality changes with the size (i.e. the results are better with small data than large data) it may be a resource issue, although that still seems odd as Spark is a deterministic framework; quality of the results shouldn't degrade with data volume.

Still, we need more testing. I'm on the road again tomorrow, but almost have a spark cluster ready at UGA. Hopefully in the next day or two. In the meantime, I'll run this on my office desktop; it has 32GB memory and 8 cores, so it should scale reasonably well.

If you and Xiang could start working on unit tests, that would be great. Small ones are fine for now--take a fraction of the input we have, get an expected output, and then have the program run it and test if the two outputs are equal within a certain tolerance (e.g. 6 decimal points). On Sat, Dec 26, 2015 at 14:09 MOJTABAFA notifications@github.com wrote:

@MAGSOL https://github.com/MAGSOL 1.The Z file size is around 40MB , So I cannot drag and drop it here ( the maximum acceptable size for a repository ticket is around 10 MB ).

  1. in milestone 2 , the z output values for both the small and big data samples were similar in value of each element . However, the dimensions were different . Now, in milestone 3 , the answer for small test sets are totally better than milestone 2 answers and extremely close to what xiang mentioned as Ground Z answers.But in Big data, However the Z dimensions are the same as milestone 2 answers, but the element values were different, However, It maybe because of some resource problems in my laptop or other reasons , But As I told you before my laptop is not suitable for testing now and it will take alot of time here. Anyway, I try to test it again and will let you know. .

— Reply to this email directly or view it on GitHub https://github.com/quinngroup/pyspark-dictlearning/issues/44#issuecomment-167355057 .

iPhone'd

MOJTABAFA commented 8 years ago

@magsol I already talked with Milad, tomorrow I'll go to university and try to check the code in lab server.Moreover, I checked the big data file and there is one point I wanted to ask your idea about that : The small file was tall and tiny with dimensions of (100,5), However the Big data file is short and fatty with (170,39850) dimensions( I mean the number of rows are much smaller than number of columns). could it be a reason for uncertain results?As I read in a paper that spark answers in matrix multiplications are always better in tall and tiny matrices.

magsol commented 8 years ago

Hmm, that's a good question. However the fact that the very nature of the data has changed has me a little worried. I thought, in general, the number of rows (data points) would far exceed the number of columns (features)? It seems like, in these two datasets, they have roughly the same number of data points (100 vs 170) but hugely differing dimensions. Is that truly the case, or have the data been accidentally transposed?

On Mon, Dec 28, 2015 at 2:44 PM MOJTABAFA notifications@github.com wrote:

@magsol https://github.com/magsol I already talked with Milad, tomorrow I'll go to university and try to check the code in lab server.Moreover, I checked the big data file and there is one point I wanted to ask your idea about that : The small file was tall and tiny with dimensions of (100,5), However the Big data file is short and fatty with (170,39850) dimensions. could it be a reason for uncertain results?As I read in a paper that spark answers in matrix multiplications are always better in tall and tiny matrices.

— Reply to this email directly or view it on GitHub https://github.com/quinngroup/pyspark-dictlearning/issues/44#issuecomment-167638095 .

iPhone'd

MOJTABAFA commented 8 years ago

@magsol Actually I don't know why but by using the transposed data the following error is appeared :

  File "/home/targol/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2089, in 
<genexpr>
  File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 26, in <lambda>
    .map(lambda x: np.array(map(float, x.strip().split("\t")))) \
ValueError: could not convert string to float: 

Moreover, It's really difficult and time consuming for me to test with my laptop because of lack of resources.

magsol commented 8 years ago

It looks like there's a non-float character that we're trying to cast to a float, e.g. float("?") or something like that.

Nonetheless, I hear you loud and clear. I'm sorry I haven't had time to finish setting up my cluster, but that's still in progress. I should have some news for you today or tomorrow.