Closed magsol closed 8 years ago
@magsol Now I'm going to test the pyspark file in thunder , but during the execution the following error is appeared :
File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 216, in <module>
S = S.apply(deflate, keepDType = True, keepIndex = True)
TypeError: apply() got an unexpected keyword argument 'keepDType'
the log file is as follows : testreport.txt
Can you figure out what it means?
iPhone'd
On Dec 24, 2015, at 14:19, MOJTABAFA notifications@github.com wrote:
@magsol Now I'm going to test the pyspark file in thunder , but during the execution the following error is appeared :
File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 216, in
S = S.apply(deflate, keepDType = True, keepIndex = True) TypeError: apply() got an unexpected keyword argument 'keepDType' the log file is as follows : testreport.txt — Reply to this email directly or view it on GitHub.
@magsol is it in deflate function ? when we're calling it in S.apply() ,shouldn't we pass the "raw" to this function ?
No. Read the error message specifically. It's complaining about about unrecognized parameter names. Check the thunder documentation and see if you can figure out how to fix it.
iPhone'd
On Dec 24, 2015, at 15:27, MOJTABAFA notifications@github.com wrote:
is it in deflate function ? when we're calling it shouldn't we pass the "raw" to this function ?
— Reply to this email directly or view it on GitHub.
Ok, let me check it.
Actually the problem was just on spelling ! In 'KeepDType' T must be change into small 't' ! I'll correct it in main file.
Now again I've tested the code on small test1 pattern , there z is much better than our previous z file ! now the z.txt is a sparse matrix , I'm going to test the bigger data set , the results of first small data set is as follows : Z.txt
Excellent work!
@magsol After a long time execution in my lap top the z matrix of big data is now around 40MB, so it's not possible to put it here. However, the results are not similar to our small data sets, where the small data set answers were totally satisfactory, the big data result is suspicious. the parameters were : m=100, n=0.07 , e =0.01 and I didn't determine the Row and Col , some part of result is as follows :
-49.625731 -90.950085 -107.148851 -22.263390 -27.960949 -74.573206 -35.491131 -1.820312 106.864215 14.072931 171.595561 -93.018838 -3.851441 281.873055 212.104157 375.934794 -69.247916 -79.771974 -27.565432 335.760112 330.057942 255.645971 129.561707 23.689732 39.457266 338.431347 358.045253 -16.198390 211.919775 120.124855 66.542751 282.075863 378.395402 -94.307979 -2.779630 -11.584412 185.832728 279.141163 101.102970 -99.788754 -82.138987 99.249246 175.284746 101.319492 -94.943044 -29.128951 26.582609 -22.439812 16.184655 -30.774730 -42.659585 -28.481978 -76.469311 -137.889147 -69.109695 -74.959590 -93.705282 -121.603436 -149.070855 -55.650968 4.239743 -17.991413 -64.647887 -55.436329 -55.543341 -233.434969 -226.427454 -73.695304 -141.986671 -140.047461 -242.440411 -280.187721 -196.235706 89.043456 -22.907281 -11.296129 -80.976172 -138.241792 -352.324480 -125.427455 43.500121 -186.793748 -112.535951 -205.595161 -278.406738 -371.797682 -80.563537 48.026023 287.180729 178.378065 121.456420 87.679904 -109.481793 -114.439424 11.187516 282.435522 -78.271834 -78.662650 -222.487548 -393.253565
After a long time execution in my lap top the z matrix of big data is now around 40MB, so it's not possible to put it there.
I'm not sure what that means.
However, the results are not similar to our small data sets, where the small data set answers were totally satisfactory, the big data results is suspicious. the parameters were : m=100, n=0.07 , e =0.01 and I didn't determine the Row and Col , some part of result is as follows :
How does it compare to what we see in the milestone 2 output?
@MAGSOL
The answer for that 2 questions :
If the quality changes with the size (i.e. the results are better with small data than large data) it may be a resource issue, although that still seems odd as Spark is a deterministic framework; quality of the results shouldn't degrade with data volume.
Still, we need more testing. I'm on the road again tomorrow, but almost have a spark cluster ready at UGA. Hopefully in the next day or two. In the meantime, I'll run this on my office desktop; it has 32GB memory and 8 cores, so it should scale reasonably well.
If you and Xiang could start working on unit tests, that would be great. Small ones are fine for now--take a fraction of the input we have, get an expected output, and then have the program run it and test if the two outputs are equal within a certain tolerance (e.g. 6 decimal points). On Sat, Dec 26, 2015 at 14:09 MOJTABAFA notifications@github.com wrote:
@MAGSOL https://github.com/MAGSOL 1.The Z file size is around 40MB , So I cannot drag and drop it here ( the maximum acceptable size for a repository ticket is around 10 MB ).
- in milestone 2 , the z output values for both the small and big data samples were similar in value of each element . However, the dimensions were different . Now, in milestone 3 , the answer for small test sets are totally better than milestone 2 answers and extremely close to what xiang mentioned as Ground Z answers.But in Big data, However the Z dimensions are the same as milestone 2 answers, but the element values were different, However, It maybe because of some resource problems in my laptop or other reasons , But As I told you before my laptop is not suitable for testing now and it will take alot of time here. Anyway, I try to test it again and will let you know. .
— Reply to this email directly or view it on GitHub https://github.com/quinngroup/pyspark-dictlearning/issues/44#issuecomment-167355057 .
iPhone'd
@magsol I already talked with Milad, tomorrow I'll go to university and try to check the code in lab server.Moreover, I checked the big data file and there is one point I wanted to ask your idea about that : The small file was tall and tiny with dimensions of (100,5), However the Big data file is short and fatty with (170,39850) dimensions( I mean the number of rows are much smaller than number of columns). could it be a reason for uncertain results?As I read in a paper that spark answers in matrix multiplications are always better in tall and tiny matrices.
Hmm, that's a good question. However the fact that the very nature of the data has changed has me a little worried. I thought, in general, the number of rows (data points) would far exceed the number of columns (features)? It seems like, in these two datasets, they have roughly the same number of data points (100 vs 170) but hugely differing dimensions. Is that truly the case, or have the data been accidentally transposed?
On Mon, Dec 28, 2015 at 2:44 PM MOJTABAFA notifications@github.com wrote:
@magsol https://github.com/magsol I already talked with Milad, tomorrow I'll go to university and try to check the code in lab server.Moreover, I checked the big data file and there is one point I wanted to ask your idea about that : The small file was tall and tiny with dimensions of (100,5), However the Big data file is short and fatty with (170,39850) dimensions. could it be a reason for uncertain results?As I read in a paper that spark answers in matrix multiplications are always better in tall and tiny matrices.
— Reply to this email directly or view it on GitHub https://github.com/quinngroup/pyspark-dictlearning/issues/44#issuecomment-167638095 .
iPhone'd
@magsol Actually I don't know why but by using the transposed data the following error is appeared :
File "/home/targol/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2089, in
<genexpr>
File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 26, in <lambda>
.map(lambda x: np.array(map(float, x.strip().split("\t")))) \
ValueError: could not convert string to float:
Moreover, It's really difficult and time consuming for me to test with my laptop because of lack of resources.
It looks like there's a non-float character that we're trying to cast to a float, e.g. float("?") or something like that.
Nonetheless, I hear you loud and clear. I'm sorry I haven't had time to finish setting up my cluster, but that's still in progress. I should have some news for you today or tomorrow.
This Spark primitive is a little trickier than #20. This is due to the fact that the matrix will be row-distributed, but in vector-matrix multiplication, the columns of the matrix are multiplied.
Still, this can be done in a fairly straightforward manner.
u
to be multiplied, e.g.sc.broadcast(u)
.flatMap
over the RDD.u
.flatMap
instead ofmap
).reduceByKey
will then sum up the values for each key, which correspond to the elements of the resulting vectoru
.