Distance and Similarity metrics

GoogleCodeExporter commented 9 years ago

Pairwise distance and similarity metrics. The following algorithms are included:
Euclidean Distance
Manhattan Distance
Cosine Distance
Jaccard Distance
Hamming Distance

Cosine Similarity Score
Euclidean Similarity Score
Pearson Correlation

All but Pearson Correlation have been double checked for accuracy against 
Matlab pdist function.

Original issue reported on code.google.com by kyle.par...@gmail.com on 3 Nov 2010 at 12:22

Attachments:

Distance_And_Similarity_Classes.zip

GoogleCodeExporter commented 9 years ago

Disclaimer: I am not an AForge developer. I do not speak for the AForge 
developers. They are perfectly capable of speaking for themselves.

I took a look at that code, and pretty quickly noticed two things:
1) Looking at «throw new ArgumentException("Input vectors must be the same 
length.");», I misinterpreted "length" as "magnitude".
2) The compiler would not complain if CustomDistance was instantiated as «new 
CustomDistance( () => {} )».

I was going to list these and go on my way, but I kept looking, and found more 
things I didn't like. (I have been accused of being a perfectionist.) 
Eventually, I gave up on just listing them and came up with the attached code. 
The functionality is the same; I saw no issues there, but the implementation is 
rather different.

changes.txt contains a list of everything I changed, and the thought processes 
behind most of those changes. It also contains a list of the ways that these 
are breaking changes.

Obviously, I like my version better -- that's why I made it -- but I could have 
missed a good reason for not making some changes. I, at least, would like to 
know what you think of this version.

Original comment by dales...@gmail.com on 4 Nov 2010 at 4:24

Attachments:

DistanceAndSimilarity.zip

GoogleCodeExporter commented 9 years ago

I like it. I think the XML comment and exception changes/additions all add 
clarity, and I think your re-structuring is more clean as well. 

The Interface vs Class approach is something I'm still debating on, and you 
probably have some thoughts on as well. I went back and forth on which approach 
to use, and ultimately I settled on classes as a "just in case" for some 
algorithm someone would want to implement down the road. I thought there might 
eventually be cases where the algorithm was complex enough to benefit from some 
internal state being added, or the user would benefit from exposing more data 
related to the algorithm than just the distance/similarity score. I don't know 
of any cases like this off the top of my head, so if this isn't a likely 
scenario, then I favor your static method approach. 

Thanks for the feedback, and the effort you put into your changes!

Original comment by kyle.par...@gmail.com on 4 Nov 2010 at 5:03

GoogleCodeExporter commented 9 years ago

@dalestan - about comments

1) Completely disagree about comments #1 and #5 (internal changes). There is 
nothing about N-dimensional arrays in the code. They are all 1-dimensional 
arrays of certain length. See WIKI:
http://en.wikipedia.org/wiki/Array_data_structure#One-dimensional_arrays
2 dimensional array means matrix, not 2 elements. Number of dimensions equals 
to number of indices you need to provide to access an element.

2) Agree about delegate for custom distance – makes it safer. However, as a 
user, I would use this class at all. It is a known fact DynamicInvoke() are 
much slower than calling interface method. Maybe someone may want to have such 
class for a quick prototyping. But if I want to use this for every pixel doing 
image processing (or maybe even video processing), then no way I am going to 
use dynamic invoke.

3) I will disagree about static methods. It is the same like someone told me 5 
years ago: “Don’t need to create IFilter interface for image processing 
routine. Just create a static function.” Let’s imagine we have a class, 
which expects distance function to be set by user? Interfaces solve it and it 
fits into Algorithm design pattern. But what about static functions?

OK, I see you solved this with delegate. What about performance in this case?

4) Agree about comments and coding style.

@kyle.parrigan

1) CosineDistance still does not implement interface.

2) I would better use x*x rather than Math.Pow(x, 2) if you care about 
performance.

3) Performance optimization:
Instead of writing this:
for (int x = 0; x < p.Length; x++)
I would better write this to be sure we don’t call property on every 
iteration:
for (int x = 0, len = p.Length; x < len; x++)
It is not important for small array, but nice to keep in mind for huge arrays.

4) Dead variables?
int n = p.Length;

5) Documentation
/// <returns>double - Cosine Similarity</returns>

Everybody will know that output is double, since it is documented anyway. 
Better write it in more human readable manner: “Returns cosine similarity 
between two specified vectors.”

6) Documentation
    /// <remarks>This class represents the Cosine Similarity metric
    /// http://en.wikipedia.org/wiki/Cosine_similarity
    /// </remarks>
    /// <para>Sample usage:</para>
    /// <code>
    /// //instantiate new similarity class
    /// CosineSimilarity sim = new CosineSimilarity(); 
    /// //build two vectors for inputs
    /// double[] p = new double[] { 2.5, 3.5, 3.0, 3.5, 2.5, 3.0 };
    /// double[] q = new double[] { 3.0, 3.5, 1.5, 5.0, 3.5, 3.0 };
    /// //call GetSimilarityScore() to return a double similarity score
    /// double similarityScore = sim.GetSimilarityScore(p, q);
    /// </code>

<code> tags are always inside of <remarks>. Check other files in the framework.

7) It is nice to document possible exceptions (which are thrown by your code) 
using <exception> tag (see framework sources).

In general about docs – see how it is done in the framework. Also nice to 
have SandCastle and run it to see what is the output.

Original comment by andrew.k...@gmail.com on 9 Nov 2010 at 4:22

Changed state: Pending

GoogleCodeExporter commented 9 years ago

> 1) Completely disagree about comments #1 and #5 (internal changes).
> There is nothing about N-dimensional arrays in the code.
> <...>
> See WIKI:
> 2 dimensional array means matrix, not 2 elements.

Please note that I said "N-dimensional *vector*", not "N-dimensional *array*". 
A vector in N dimensions is not the same thing as an array that has N 
dimensions. (-2, 3, 6) might be interpreted as a three dimensional vector of 
length 7, or a point in 3-space a distance of 7 from the origin, but it would 
never be called a "one-dimensional vector". You and I might choose to represent 
it as a one-dimensional array of length 3, but this representation neither 
causes the vector to become one-dimensional, nor causes the vector's length to 
become 3.

> 3) It is the same like someone told me 5 years ago: “Don’t need to create
> IFilter interface for image processing routine. Just create a static 
function.”

This, I believe, is a different case. IFilters store internal data, and 
BaseFilter and friends provide significant benefits to filter implementers 
(e.g. coalescing all .Apply overloads into one ProcessFilter).
ISimilarity and IDistance implementations (except CustomDistance) require no 
internal data and there are no abstract classes to hide (nonexistent) mundane 
portions of the implementation.

> ... OK, I see you solved this with delegate.
> What about performance in this case?

I know very little about performance in .NET. What I know about the x86 
processor architecture, however, leads me to believe that direct invocation of 
a delegate should take about the same amount of time as invocation of an 
interface function (call-through-pointer in both cases). Likewise, it does not 
surprise me that DynamicInvoke is slower, because it has to do lots of work 
before it can perform the call: argument count verification, casting (and its 
verification), unboxing (if necessary), and so forth.

Original comment by dales...@gmail.com on 9 Nov 2010 at 10:21

GoogleCodeExporter commented 9 years ago

> Please note that I said "N-dimensional *vector*", not "N-dimensional *array*".
> A vector in N dimensions is not the same thing as an array that has N 
dimensions.
> (-2, 3, 6) might be interpreted as a three dimensional vector of length 7, or 
a
> point in 3-space a distance of 7 from the origin, but it would never be called
> a "one-dimensional vector".
Aah, ok agree. Just messed between vector and array things. I am just afraid 
that this may cause confusion to someone else without extra clarification.

> ISimilarity and IDistance implementations (except CustomDistance) require
> no internal data
Yes, no internal data. However interfaces and abstract classes are not only 
about "hiding (nonexistent) mundane portions of the implementation". As I 
already mentioned, what if I have a class which may work with any distance 
function given? In this case passing interface seems more elegant than a 
delegate (at least for me).

Anyway, here are some facts:

1) When to Use Delegates Instead of Interfaces (C# Programming Guide)
http://msdn.microsoft.com/en-us/library/ms173173(VS.80).aspx
From this page it is still arguable what to use - delegate seems to be fine 
also.

2) Delegates vs. Interfaces in .NET
http://www.informit.com/articles/article.aspx?p=332881
Here we already can find this: "Interface calls are faster than delegate calls. 
An interface reference is a reference to an instance of an object which 
implements the interface. An interface call is not that different from an 
ordinary virtual call to an method. A delegate reference, on the other hand, is 
a reference to a list of method pointers. While invoking a delegate looks like 
you're making an indirect call through a method pointer, it's actually a 
subroutine call that walks the list of method pointers. The overhead involved 
in making the call and walking the list means that delegate invocation can be 
two or three times slower than calling a method through an interface reference."

Another thing about interfaces … I can create a class something like 
CosineMetric, which implement both IDistance and ISimilarity.

One more. Although it may not be the case here, but with interfaces it is 
really easy to find all classes at runtime implementing IDistance using 
reflection. Kind of plug-ins system.

Original comment by andrew.k...@gmail.com on 10 Nov 2010 at 10:18

GoogleCodeExporter commented 9 years ago

@kyle.parrigan

Check all your WIKI links. Some links are broken. For example this page does 
not exist:
http://en.wikipedia.org/wiki/Euclidean_similarity

Also it seems better to write:
    /// <remarks>This class represents the <a href="...">Euclidean Similarity metric</a>
    ///

instead of:
    /// <remarks>This class represents the Euclidean Similarity metric
    /// http://en.wikipedia.org/wiki/Euclidean_similarity

Original comment by andrew.k...@gmail.com on 10 Nov 2010 at 10:21

GoogleCodeExporter commented 9 years ago

The Pending status means that update of code is expected. We may discus it here 
a lot, but someone needs to take care of the code if we want to merge it into 
the framework ;)

Original comment by andrew.k...@gmail.com on 10 Nov 2010 at 10:23

GoogleCodeExporter commented 9 years ago

I'll make the changes for the links/comments tonight. Are we settled on using 
the Interface-based approach for now?

Original comment by kyle.par...@gmail.com on 10 Nov 2010 at 2:22

GoogleCodeExporter commented 9 years ago

> Are we settled on using the Interface-based approach for now? 
You are part of the discussion, so your opinion matters as well :) I've put my 
arguments and I am for interfaces.

Original comment by andrew.k...@gmail.com on 10 Nov 2010 at 2:28

GoogleCodeExporter commented 9 years ago

My opinion hasn't changed a whole lot since my first response. I still think 
that, unless we are pretty sure all future distance and similarity methods will 
be this simple, an interface-based approach is the way to go. I wasn't aware 
that there was a significant performance advantage over using delegates, so 
that further reinforces my opinion. 

One question about the CustomDistance class. Do you think we really need/want 
that functionality? The more I think about it, the more I think it would be 
easy enough for a user to just add their own distance class to the project. As 
it is now, I think just writing the distance algorithm would be the hardest 
part, and it wouldn't really be a big deal for a user to throw it into their 
own class that implements IDistance.

Original comment by kyle.par...@gmail.com on 10 Nov 2010 at 2:44

GoogleCodeExporter commented 9 years ago

As I already mentioned, as for me I would create a class implementing certain 
interface if I want to have new distance/similarity metric. Maybe a delegate 
will be 2 seconds faster to write, but I still would not keep it long term 
(only for test maybe). If I need better performance for algorithm which calls 
distance metric thousands or millions times, then I would go to interface.

Original comment by andrew.k...@gmail.com on 10 Nov 2010 at 2:53

GoogleCodeExporter commented 9 years ago

Btw, if there any unit tests available, they will be really welcome.

Original comment by andrew.k...@gmail.com on 10 Nov 2010 at 4:16

GoogleCodeExporter commented 9 years ago

Sure thing. I originally wrote unit tests for each method using NUnit, but I 
see AForge uses a different unit testing framework. I'll check that out tonight 
as well and see what I need to change in my unit tests to integrate them with 
AForge's testing library.

Original comment by kyle.par...@gmail.com on 10 Nov 2010 at 4:20

GoogleCodeExporter commented 9 years ago

Sounds like interface implementations are the decision.

With that decision in mind: I don't see a reason for someone to inherit from an 
IDifference or ISimilarity implementation. If you don't see one either, I'd 
suggest making all the implementing classes sealed. Sealing gives the compiler 
more opportunities to make the calls non-virtual, and therefore faster.

Original comment by dales...@gmail.com on 11 Nov 2010 at 7:28

GoogleCodeExporter commented 9 years ago

Sounds good. So we have interfaces and any class implementing them is marked as 
sealed.

Original comment by andrew.k...@gmail.com on 11 Nov 2010 at 8:00

GoogleCodeExporter commented 9 years ago

Okay I believe I have all the changes we discussed made. I ran Sandcastle and 
verified that the documentation does look right for all the classes. I still 
haven't gotten back to the unit tests and setting up Gallio/MbUnit, but I 
should have unit tests added in a post over the weekend. 

Here is a summary of the changes made. Let me know if I missed anything. 
-Cosine Distance now implements IDistance
-Replaced all instances of Math.Pow(x,2) with x*x
-Removed dead variables
-Cleaned up for-loops to avoid accessing length property on each iteration
-Removed Custom Distance class
-Made recommended comment changes for all classes (e.g. fixed/modified URLs, 
moved <code> sections, corrected summaries, added exception tags, etc.) 
-Changed dimensionality exception message. 
-Sealed all distance and similarity classes

Original comment by kyle.par...@gmail.com on 12 Nov 2010 at 1:09

Attachments:

DistanceAndSimilarityClasses.zip

GoogleCodeExporter commented 9 years ago

Committed in revision 1340. However, I found few issues which need to be 
resolved. See my comments below ....

Changes:

1)  Changed  Hamming distance to return number of difference instead of number 
of difference divided by length. As we can seen from algorithm description 
there is no mapping to [0, 1] range, where 1 is max difference. 
2)  Put all distances and similarities into single namespace – Metrics. Just 
to avoid having too many namespaces with few classes.
3)  Both Cosine and Euclidean similarities were not implementing interface, so 
put it there.
4)  Removing names from source codes. As it was mentioned in one of the 
Google’s  video casts related to open source projects, let’s avoid claiming 
ownership for particular file. I started to remove my name also from those 
files, which are edited recently. All the names are listed in Contributors 
page, so let’s keep code free of them. Hope you are fine with this.
5)  Some minor changes here and there.

Issues to resolve:
1)  Current implementation of Cosine Distance seems to implement Cosine 
Similarity. There is no difference between its implementation and the formula 
given on WIKI for SIMILARITY.
2)  In Jaccard Distance, why do we ignore the case, when both p[x] and q[x] 
values are equal to 0? What is wrong with 0? As a result of this, suppose we 
have two vectors, which have zeros. In this case both intersection and union 
will be 0 and their division will give NaN ... By the way, if we start 
processing zero, which should be done in my mind, then this distance metric 
becomes the same Hamming (only mapped [0, 1], since I removed mapping from 
Happing). Any point in it then?

Comments:
1)  Both Cosine Distance and Cosine Similarity refer to the same page on WiKi, 
which tells how to calculate Similarity, there is nothing said about 
Difference. So something should be said about the way you calculate difference.
2)  Both Euclidean Distance and Euclidean Similarity refer to the same page on 
Wiki, which tells only about distance. So need clarification about your way of 
calculating similarity.

Original comment by andrew.k...@gmail.com on 12 Nov 2010 at 1:40

Added labels: Milestone-Coming, Project-Math

GoogleCodeExporter commented 9 years ago

Since the code is in SVN now, the next update would be better to have as a 
patch. Tests may come as new files.

Original comment by andrew.k...@gmail.com on 12 Nov 2010 at 1:41

GoogleCodeExporter commented 9 years ago

For both items in the "Comments" section, and item #1 in the "Issues to 
Resolve": In both Cosine and Euclidean distance the similarity is just 1 minus 
the distance since we want a higher similarity score for closer distances. I'd 
be happy to add comments stating that if it is needed. 

For #2 in "Issues to resolve": I ignored the zeros because Matlab's pdist 
function calculates the distance that way (so it was easier for me to test). I 
believe Wikipedia also shows that method for the distance when calculating for 
binary vectors. However, we're not neccessarily dealing with binary vectors, 
and there's no need to conform to Matlab's implementation, so I could go either 
way. If we keep it as is, I think a check to make sure divide by zero isn't 
possible should be added. 

http://www.mathworks.com/help/toolbox/stats/pdist.html
http://en.wikipedia.org/wiki/Jaccard_index

Original comment by kyle.par...@gmail.com on 3 Dec 2010 at 2:21

GoogleCodeExporter commented 9 years ago

> In both Cosine and Euclidean distance the similarity is just 1 minus the 
distance
> since we want a higher similarity score for closer distances. I'd be happy to 
add
> comments stating that if it is needed. 
Yes, but I still think the code is wrong for Cosine distance and similarity. 
Here is a simple test:

AForge.Math.Metrics.CosineDistance distance = new 
AForge.Math.Metrics.CosineDistance( );
AForge.Math.Metrics.CosineSimilarity similarity = new 
AForge.Math.Metrics.CosineSimilarity( );

double[] p = { 1, 1, 1 };
double[] q = { 1, 1, 1 };

double d = distance.GetDistance( p, q );
double s = similarity.GetSimilarityScore( p, q );

This code will give you distance equal to 1, but similarity equal to 0. This is 
what I told before - you've messed them. It should be exactly the opposite.

Original comment by andrew.k...@gmail.com on 3 Dec 2010 at 2:36

GoogleCodeExporter commented 9 years ago

Oh, yep. You're absolutely right. I'll add a patch for that and the jaccard 
distance over the weekend.

Original comment by kyle.par...@gmail.com on 3 Dec 2010 at 2:48

GoogleCodeExporter commented 9 years ago

Here is the patch for Cosine Similarity/Distance.

Original comment by kyle.par...@gmail.com on 4 Dec 2010 at 3:14

Attachments:

Cosine_1340.patch

GoogleCodeExporter commented 9 years ago

Here is the patch for Jaccard distance. I maintained the 0 comparison, and 
added a check to make sure we can never divide by zero. Regarding your earlier 
comment: unless we go back to the original version of Hamming and change 
Jaccard to include 0, then yes, I think we should keep Jaccard.

Original comment by kyle.par...@gmail.com on 5 Dec 2010 at 11:02

Attachments:

Jaccard_1340.patch

GoogleCodeExporter commented 9 years ago

Committed in revision 1356.

Will any tests come?

Original comment by andrew.k...@gmail.com on 6 Dec 2010 at 2:37

GoogleCodeExporter commented 9 years ago

You must have read my mind because I was just talking about those with a 
co-worker a few minutes ago. Yes, I should have unit tests for you tonight or 
tomorrow night at the latest.

Original comment by kyle.par...@gmail.com on 6 Dec 2010 at 2:48

GoogleCodeExporter commented 9 years ago

Attached are unit tests for each distance and similarity metric. Also, I've 
attached a patch file addressing the same "divide by zero" scenario in Pearson 
and Cosine similarity that we addressed in Jaccard distance.

Original comment by kyle.par...@gmail.com on 8 Dec 2010 at 2:31

Attachments:

GoogleCodeExporter commented 9 years ago

Committed in revision 1357. Seems like this is it about this ticket ...

Original comment by andrew.k...@gmail.com on 13 Dec 2010 at 3:19

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Original comment by andrew.k...@gmail.com on 12 Jan 2011 at 11:44

Changed state: Released
Removed labels: Milestone-Coming

suicao / aforge

Distance and Similarity metrics #169