ozcan / scipy-cluster

Automatically exported from code.google.com/p/scipy-cluster
Other
0 stars 1 forks source link

Correlation distance errors, confusion and possible fix #3

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

import numpy
import hcluster

x1 = numpy.random.randn(10,)
x2 = numpy.random.randn(10,)
xx = numpy.vstack((x1, x2))

# first error
hcluster.correlation(x1, x2)

# second error
hcluster.pdist(xx, 'correlation')

What is the expected output? What do you see instead?
I expected 1-pearson correlation coeff. 

Error #1
usr/lib/python2.5/site-packages/hcluster/cluster.py in correlation(u, v)
    830     vm = v - vmu
    831     return 1.0 - (scipy.dot(um, vm.T) / (math.sqrt(scipy.dot(um,
vm).T)) \
--> 832             * math.sqrt(scipy.dot(vm, vm.T)))

Error #2
usr/lib/python2.5/site-packages/hcluster/cluster.py in pdist(X, metric, p,
V, VI)
   1372             dm = squareform(dm)
   1373         elif mstr in set(['correlation', 'co']):
-> 1374             X2 = X - numpy.repmat(numpy.mean(X, axis=1).reshape(m,
1), 1, n)
   1375             norms = numpy.sqrt(numpy.sum(X2 * X2, axis=1))
   1376             _cluster_wrap.pdist_cosine_wrap(X2, dm, norms)

<type 'exceptions.AttributeError'>: 'module' object has no attribute 'repmat'

What version of the product are you using? On what operating system?
Python2.5, numpy1.0.5.dev, hcluster (current svn), linux, 32

Please provide any additional information below.

I dont really get the documentation with the manhatten norm and all :-), Im
just assuming 1-pers.corr.coeff. If thats right, here is my fix (diff):

- cluster.py  (revision 90)
+++ cluster.py  (working copy)
@@ -828,8 +828,8 @@
     umu = u.mean()
     um = u - umu
     vm = v - vmu
-    return 1.0 - (scipy.dot(um, vm.T) / (math.sqrt(scipy.dot(um, vm).T)) \
-            * math.sqrt(scipy.dot(vm, vm.T)))
+    return 1.0 - (scipy.dot(um, vm.T) / ((math.sqrt(scipy.dot(um, um.T))) \
+            * math.sqrt(scipy.dot(vm, vm.T))))

 def hamming(u, v):
     """
@@ -1371,7 +1371,7 @@
             dm[xrange(0,m),xrange(0,m)] = 0
             dm = squareform(dm)
         elif mstr in set(['correlation', 'co']):
-            X2 = X - numpy.repmat(numpy.mean(X, axis=1).reshape(m, 1), 1, n)
+            X2 = X - X.mean(1)[:,numpy.newaxis]
             norms = numpy.sqrt(numpy.sum(X2 * X2, axis=1))
             _cluster_wrap.pdist_cosine_wrap(X2, dm, norms)
         elif mstr in set(['mahalanobis', 'mahal', 'mah']):

Arnar
arnar.flatberg@gmail.com

Original issue reported on code.google.com by arnar.fl...@gmail.com on 20 Feb 2008 at 3:14

GoogleCodeExporter commented 8 years ago
Arnar,

Thanks for your bug report.

1. <type 'exceptions.AttributeError'>: 'module' object has no attribute 
'repmat':

Response:

I learned that repmat has been moved out of the main numpy module into 
numpy.matlib.
This problem has been fixed in the latest SVN. I will do a release later 
tonight that
will include this fix.

-------------------------------
2. import numpy
import hcluster

x1 = numpy.random.randn(10,)
x2 = numpy.random.randn(10,)
xx = numpy.vstack((x1, x2))

# <> error
hcluster.correlation(x1, x2)

Response:

First, you know you can do xx = numpy.random.randn(2,10)?

Second, you're right the output of hcluster.correlation does not look right.
In fact, once I fix problem (1), I get a different output.

---------------------------
3. I dont really get the documentation with the manhatten norm and all :-)

Response:

The Manhattan norm of a vector x is ||x||_1=\frac{1}{n}\sum_{x=1}^{n}{x_i} or
just the mean of the elements of the vector. It is the city block walking
distance between the origin and the point x.

---------------------------
4. What is the expected output? What do you see instead? I expected
1-pearson correlation coeff. 

Response:

I think it was a parenthesis error. It's supposed to be the correlation
coefficient. The problem has been fixed, and now its output corresponds
with MATLAB's.

---------------------------
5.  + X2 = X - X.mean(1)[:,numpy.newaxis]
    - X2 = X - numpy.matlib.repmat(numpy.mean(X, axis=1).reshape(m, 1), 1, n)

Response:

Your diff fix is more memory efficient. Thanks.

---------------------------

You should see a release later in the evening.

Damian Eads

Original comment by damian.e...@gmail.com on 25 Feb 2008 at 2:04