Comparison scripts - Githubissues

magsol commented 8 years ago

For a given input:

Run the Python (not Spark) script on the input to compute the output OR have a pre-made Z.txt file
Generate output with the Spark version
Compute statistics of the two outputs: how different are they? Where are they different?

This will help identify discrepancies between the two implementations.

MOJTABAFA commented 8 years ago

@magsol Regarding to our python code , It already normalized by considering ( axis = 0 ). However, in our pyspark code the axis is already 1 ( axis = 1 ). could it make a problem ? should I change the axis in python code to 1 ?( I already assumed your answer is yes ,hence I changed it to 1 but the following value error is appeared) :

  File "/home/targol/anaconda2/lib/python2.7/site-packages/statsmodels/R1DL.py", line 159, in <module>
    main()
  File "/home/targol/anaconda2/lib/python2.7/site-packages/statsmodels/R1DL.py", line 111, in main
    S = S - S.mean(axis = 1)
ValueError: operands could not be broadcast together with shapes (39510,170) (39510,)

magsol commented 8 years ago

I'm having trouble following your question; are you asking why the Python implementation used S = S - S.mean(axis = 0) while the PySpark code used S.zscore(axis = 1)?

If that is your question, you can very easily test out the Python implementation to make sure it's correct (this is why we need unit tests!!!):

import numpy as np
import numpy.linalg as nla

rand = np.random.random((3, 10))  # creates 3x10 random matrix

# Goal is normalize columns
rand_n = rand - rand.mean(axis = 0)

# Do the rows or the columns have 0-mean?
rand_n.mean(axis = 1)  # This only prints 3 numbers, clearly coinciding with rows, NOT columns
rand_n.mean(axis = 0)  # This prints 10 numbers, and they're all close to 0. So this is correct.

rand_n = rand_n / nla.norm(rand_n, axis = 0)

# Do the rows or columns have a 1-norm?
nla.norm(rand_n, axis = 1)  # Again, this only prints 3 numbers, coinciding with rows, NOT cols
nla.norm(rand_n, axis = 0)  # This prints 10 numbers, one for each col, all 1s. This is correct.

So we know the Python implementation is correct.

For the PySpark implementation, you have to figure out which axis value corresponds to rows and which to columns. The documentation says 0 runs along rows, 1 along columns, but we can test that the same way to be sure (again, this is why we need unit tests!!!).

However, it looks like, while zscore is working along the axis it should, it's not correctly setting the column norms to 1. It's definitely renormalizing the columns, but the norms aren't 1; they're all something slightly larger than 1, so ultimately the values are off (they do appear to have 0-mean, however). So this is something we need to look into more.

MOJTABAFA commented 8 years ago

@magsol @LindberghLi Already the python code has been tested for new big data sample ( MOTOR data )and we put the out put file at http://1drv.ms/1P4X3KR Regarding to Xiang's explanation about Z matrix in his last email, it seems that the Python answer is satisfactory enough. However we still need his comments for spark version.

magsol commented 8 years ago

The OneDrive link is not loading for me; it spins indefinitely.

MOJTABAFA commented 8 years ago

@LindberghLi @magsol I already revised the python code based on my discussion with Xiang, and the results for python code seems acceptable ( where around 4000 of indices are 0 and more than 40,000 are less than 0.00009) Xiang please check the results and let me know if it seems to be correct . http://1drv.ms/1JYXuUO

XiangLi-Shaun commented 8 years ago

@MOJTABAFA Is python the code on github up-to-date? I'll need to take a look at it because the output are supposed to hard-coded to have many 0 elements in each row vector. Also what is the parameter you have used?

magsol commented 8 years ago

@MOJTABAFA I don't see any revisions posted; the most recent commit is #83 f20ea07220bf93dbd835e8db7d8066a0c36424b7 and was mine, fixing some errors the unit tests revealed.

MOJTABAFA commented 8 years ago

@LindberghLi Sorry , already I left the lab . No I'll put the python code tomorrow afternoon on GitHub, but the parameters I've been used are : e = .01, m= 100, r= .07

MOJTABAFA commented 8 years ago

@magsol Yes you're right, because of a problem I had to left the lab @ 8:45, so I couldn't push and commit the code , tomorrow I'll do that. But the results are uploaded in above link. Thanks

MOJTABAFA commented 8 years ago

@LindberghLi @magsol Xiang ,to make sure about keeping the previous original file of "R1DL.py" , I already created new file with name "R1DL-TR.py" , now you can check it better. Thanks.

MOJTABAFA commented 8 years ago

@magsol @LindberghLi After debugging the python code based on what we have discussed n our yesterday meeting , I already ran the python code with xiang's big data file (~70MB) and the results sound well. It's stored on http://1drv.ms/1PzWuCG . Xiang please confirm if the result is acceptable now. Now, I'm going to apply other changes on spark code.

quinngroup / dr1dl-pyspark

Comparison scripts #47