quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Normalization functions #9

Closed MOJTABAFA closed 8 years ago

MOJTABAFA commented 8 years ago

Dear Dr. quinn: For normalization functions , It seems that the functions are mostly heuristic and designed based on experiences to be fit with this problem. Thus it's not possible to find an exact function equivalent with this functions in numpy or scipy . Therefore I Think I should convert line by line the normalization functions of xiang's code . for example I wrote the following one for "stat_normalize2l2NormVCT" as :

import numpy as np
vct_input = np.array([0,1,2,5,0],dtype=float) T=5 double_l2norm = 0 for t in range(T): double_l2norm = vct_input[t]*vct_input[t] + double_l2norm print(vct_input[t]) double_l2norm = np.sqrt(double_l2norm)

for t in range (T): vct_input[t] = vct_input[t]/double_l2norm print(vct_input) ===================={ out put}========== 0.0 1.0 2.0 5.0 0.0 [ 0. 0.18257419 0.36514837 0.91287093 0. ] [Finished in 0.3s]

magsol commented 8 years ago

Thanks Mojtaba, I'll check it in a little while.

iPhone'd

On Nov 21, 2015, at 20:12, MOJTABAFA notifications@github.com wrote:

Assigned #9 to @magsol.

— Reply to this email directly or view it on GitHub.

magsol commented 8 years ago

Normalizing a vector by its l2 norm is the same thing as making the vector unit length; you're dividing each element of the vector v[i] by the magnitude of the vector ||v||. To do this, use SciPy's linear algebra library:

import numpy as np
import scipy.linalg as sla

a = np.array([1, 2, 3, 4, 5], dtype = np.float)
print sla.norm(a)  # "2.397587827269453"
b = a / sla.norm(a)
print sla.norm(b)  # "1.0"

So the vector b is the normalized, unit-length version of the vector a.

MOJTABAFA commented 8 years ago

@LindberghLi Xiang: would you please check above Dr. Quinn's comment on normalization function ? Please check it out and let me know if this kind of normalization can satisfy your problem . then I can work on it and will try to optimize the normalization functions and add them to main program.

magsol commented 8 years ago

Just FYI: Kind of like other social media, you can tag people in notes with the '@' symbol, @MOJTABAFA.

MOJTABAFA commented 8 years ago

Ok , thanks. I will update the comment now

MOJTABAFA commented 8 years ago

@magsol I already copied your code in my sublime but there is some error, I'm not sure but I think I didnt install the scipy . am I right ? Error:

Traceback (most recent call last): File "C:\Users\Mojtaba Fazli\Desktop\normalization.py", line 2, in import scipy.linalg as sla File "C:\Anaconda3\lib\site-packages\scipy\linalginit.py", line 172, in from .misc import * File "C:\Anaconda3\lib\site-packages\scipy\linalg\misc.py", line 5, in from .blas import get_blas_funcs File "C:\Anaconda3\lib\site-packages\scipy\linalg\blas.py", line 155, in from scipy.linalg import _fblas ImportError: DLL load failed: The specified module could not be found.

MOJTABAFA commented 8 years ago

... but "conda list" command shows that there is a scipy library installed in my system

XiangLi-Shaun commented 8 years ago

@MOJTABAFA

The code by Shannon is good for the l-2 norm normalization, but you'll need a similar route for the zero-mean normalization as well.

MOJTABAFA commented 8 years ago

@magsol now it works : I should change the scipy import as :

from numpy import linalg as sla

MOJTABAFA commented 8 years ago

@magsol But I dont know why its answers are different from those you mentioned ?

import numpy as np from numpy import linalg as sla

a = np.array([1, 2, 3, 4, 5], dtype = np.float) print (sla.norm(a)) # "2.397587827269453" c = sla.norm(a) b = a / c print (b) # "1.0"

=========={ out put }========== 7.4161984871 [ 0.13483997 0.26967994 0.40451992 0.53935989 0.67419986] [Finished in 0.2s]

MOJTABAFA commented 8 years ago

@LindberghLi what about this for mean-zero normalization ? import numpy as np

y = np.random.randn(10, 10) print(y) normed = (y - y.mean(axis=0)) / y.std(axis=0) print('normed mean =',normed.mean(axis=0)) print('normed std =',normed.std(axis=0))

====================== out put ============= [[ 0.67912547 -0.51589505 -0.4424499 -0.16515243 -1.34762102 0.22626589 0.34721551 1.45637866 -0.24009679 0.21131739] [ 0.5058812 -0.77646398 0.47343891 -0.3821469 1.60240853 -0.54143379 0.15397853 -0.34675699 -0.47872134 -0.20981917] [ 1.55658057 0.14938842 0.1395679 0.03975734 0.10721487 -0.16563412 1.21940819 -0.47438863 -1.13381981 -0.20517275] [ 0.85730614 0.01776607 1.22002908 0.9858714 0.43821209 -0.23075819 1.20476702 -2.01791451 1.39054771 -1.49560731] [-1.49584899 -1.70729191 -0.36759594 0.44967996 -1.16665163 -0.47875628 -0.77648296 0.32686771 0.48212816 1.61136346] [ 0.96703504 0.35095139 0.38318928 0.94518336 -1.72319926 -0.15169197 -1.9715908 -0.62311711 -0.52933993 -0.05238334] [ 0.24992697 -1.4416581 -0.56934585 1.81037335 0.67048827 2.04979197 0.8786347 -1.27356192 -0.30720224 -1.54699837] [-0.54240094 -0.19582847 1.39024218 -1.7890984 0.39088153 -0.05736905 0.64651929 0.53540127 1.02180067 0.50595341] [ 0.00660171 0.56305246 1.84318845 0.30656014 -0.58597558 -0.83547812 -0.42220257 -0.60727885 -0.39588576 0.03611984] [-0.6444746 -0.31379341 -0.57833217 1.27285969 -0.68151022 1.52165882 -0.21176998 0.17241932 -1.27050344 0.90365203]] normed mean= [ 8.88178420e-17 -5.68989300e-17 8.88178420e-17 5.55111512e-17 -1.66533454e-17 -2.22044605e-17 -2.77555756e-17 -7.21644966e-17 0.00000000e+00 -2.22044605e-17] normed std= [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [Finished in 0.4s]

XiangLi-Shaun commented 8 years ago

It seems good

MOJTABAFA commented 8 years ago

@magsol Actally I get confused , because when I'm running the xiang algorithm for normalization my answer would be as follows : import numpy as np

mtx_input= np.arange(100).reshape(10,10) print('original mat= \n',mtx_input) for p in range(10): double_mean = 0 for t in range (10): double_mean = mtx_input[[t], [p]] + double_mean double_mean = double_mean/10 for t in range (10): mtx_input[[t], [p]] = mtx_input[[t], [p]] - double_mean print(mtx_input)

================================out put ================ original mat= [[ 0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 25 26 27 28 29] [30 31 32 33 34 35 36 37 38 39] [40 41 42 43 44 45 46 47 48 49] [50 51 52 53 54 55 56 57 58 59] [60 61 62 63 64 65 66 67 68 69] [70 71 72 73 74 75 76 77 78 79] [80 81 82 83 84 85 86 87 88 89] [90 91 92 93 94 95 96 97 98 99]] normalized mat= [[-45 -45 -45 -45 -45 -45 -45 -45 -45 -45] [-35 -35 -35 -35 -35 -35 -35 -35 -35 -35] [-25 -25 -25 -25 -25 -25 -25 -25 -25 -25] [-15 -15 -15 -15 -15 -15 -15 -15 -15 -15] [ -5 -5 -5 -5 -5 -5 -5 -5 -5 -5] [ 5 5 5 5 5 5 5 5 5 5] [ 15 15 15 15 15 15 15 15 15 15] [ 25 25 25 25 25 25 25 25 25 25] [ 35 35 35 35 35 35 35 35 35 35] [ 45 45 45 45 45 45 45 45 45 45]] [Finished in 0.4s]

but when I try to do the zero-mean normalization with numpy the results are different:

import numpy as np

y = np.arange(100).reshape(10,10) print('original mat= \n',y) normed = (y - y.mean(axis=0)) / y.std(axis=0) print('normed mean=',normed.mean(axis=0))

print('normed std= ',normed.std(axis=0))

==================={ out put}====================

original mat= [[ 0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 25 26 27 28 29] [30 31 32 33 34 35 36 37 38 39] [40 41 42 43 44 45 46 47 48 49] [50 51 52 53 54 55 56 57 58 59] [60 61 62 63 64 65 66 67 68 69] [70 71 72 73 74 75 76 77 78 79] [80 81 82 83 84 85 86 87 88 89] [90 91 92 93 94 95 96 97 98 99]] normed mean= [ -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16 -1.11022302e-16] [Finished in 0.4s]

MOJTABAFA commented 8 years ago

@LindberghLi

xiang do you have any idea about above comment?

magsol commented 8 years ago

It looks like numpy.linalg.norm and scipy.linalg.norm have identical operations, which is good. However, we need to figure out what the bug is on your end in importing scipy because that is a critical library for other operations not present in numpy.

I haven't looked closely but I think your "version" of Xiang's algorithm has a bug in the array indexing that's resulting in different output. The numpy version you posted looks good.

Just FYI: in Python, the convention for vectors is lowercase variable names (like y), but for matrices it's uppercase (like Y). Also, when testing for zero-mean, unit-variance, you can do this:

A = np.random.random((10, 10))

print A.mean(axis = 0)
# [ 0.59888529,  0.40256814,  0.52723793,  0.5827174 ,  0.35847958,
#        0.47607431,  0.58255637,  0.51890551,  0.56916436,  0.44384175]
print A.std(axis = 0)
# [ 0.23854139  0.27000021  0.28236851  0.30656586  0.2882628   0.33507456
#  0.24044369  0.20864492  0.24725187  0.34637215]

B = (A - A.mean(axis = 0)) / A.std(axis = 0)

print B.mean(axis = 0)
# [  3.55271368e-16  -1.11022302e-16  -1.33226763e-16  -4.44089210e-17
#   6.10622664e-17  -1.11022302e-16   2.44249065e-16   5.27355937e-17
#  -2.24820162e-16   9.99200722e-17]
print B.std(axis = 0)
# [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
MOJTABAFA commented 8 years ago

@magsol

Thank you very much, actually already I discussed with Xiang and the problem is solved , However, the problem of scipy is still remained and I need to meet you after holidays and check with you in my laptop. I'll push the changed codes on the github , please check it and close the ticket if every thing seems good.