Closed GoogleCodeExporter closed 8 years ago
Disclaimer: I am not an AForge developer. I do not speak for the AForge
developers. They are perfectly capable of speaking for themselves.
I took a look at that code, and pretty quickly noticed two things:
1) Looking at «throw new ArgumentException("Input vectors must be the same
length.");», I misinterpreted "length" as "magnitude".
2) The compiler would not complain if CustomDistance was instantiated as «new
CustomDistance( () => {} )».
I was going to list these and go on my way, but I kept looking, and found more
things I didn't like. (I have been accused of being a perfectionist.)
Eventually, I gave up on just listing them and came up with the attached code.
The functionality is the same; I saw no issues there, but the implementation is
rather different.
changes.txt contains a list of everything I changed, and the thought processes
behind most of those changes. It also contains a list of the ways that these
are breaking changes.
Obviously, I like my version better -- that's why I made it -- but I could have
missed a good reason for not making some changes. I, at least, would like to
know what you think of this version.
Original comment by dales...@gmail.com
on 4 Nov 2010 at 4:24
Attachments:
I like it. I think the XML comment and exception changes/additions all add
clarity, and I think your re-structuring is more clean as well.
The Interface vs Class approach is something I'm still debating on, and you
probably have some thoughts on as well. I went back and forth on which approach
to use, and ultimately I settled on classes as a "just in case" for some
algorithm someone would want to implement down the road. I thought there might
eventually be cases where the algorithm was complex enough to benefit from some
internal state being added, or the user would benefit from exposing more data
related to the algorithm than just the distance/similarity score. I don't know
of any cases like this off the top of my head, so if this isn't a likely
scenario, then I favor your static method approach.
Thanks for the feedback, and the effort you put into your changes!
Original comment by kyle.par...@gmail.com
on 4 Nov 2010 at 5:03
@dalestan - about comments
1) Completely disagree about comments #1 and #5 (internal changes). There is
nothing about N-dimensional arrays in the code. They are all 1-dimensional
arrays of certain length. See WIKI:
http://en.wikipedia.org/wiki/Array_data_structure#One-dimensional_arrays
2 dimensional array means matrix, not 2 elements. Number of dimensions equals
to number of indices you need to provide to access an element.
2) Agree about delegate for custom distance – makes it safer. However, as a
user, I would use this class at all. It is a known fact DynamicInvoke() are
much slower than calling interface method. Maybe someone may want to have such
class for a quick prototyping. But if I want to use this for every pixel doing
image processing (or maybe even video processing), then no way I am going to
use dynamic invoke.
3) I will disagree about static methods. It is the same like someone told me 5
years ago: “Don’t need to create IFilter interface for image processing
routine. Just create a static function.” Let’s imagine we have a class,
which expects distance function to be set by user? Interfaces solve it and it
fits into Algorithm design pattern. But what about static functions?
OK, I see you solved this with delegate. What about performance in this case?
4) Agree about comments and coding style.
@kyle.parrigan
1) CosineDistance still does not implement interface.
2) I would better use x*x rather than Math.Pow(x, 2) if you care about
performance.
3) Performance optimization:
Instead of writing this:
for (int x = 0; x < p.Length; x++)
I would better write this to be sure we don’t call property on every
iteration:
for (int x = 0, len = p.Length; x < len; x++)
It is not important for small array, but nice to keep in mind for huge arrays.
4) Dead variables?
int n = p.Length;
5) Documentation
/// <returns>double - Cosine Similarity</returns>
Everybody will know that output is double, since it is documented anyway.
Better write it in more human readable manner: “Returns cosine similarity
between two specified vectors.”
6) Documentation
/// <remarks>This class represents the Cosine Similarity metric
/// http://en.wikipedia.org/wiki/Cosine_similarity
/// </remarks>
/// <para>Sample usage:</para>
/// <code>
/// //instantiate new similarity class
/// CosineSimilarity sim = new CosineSimilarity();
/// //build two vectors for inputs
/// double[] p = new double[] { 2.5, 3.5, 3.0, 3.5, 2.5, 3.0 };
/// double[] q = new double[] { 3.0, 3.5, 1.5, 5.0, 3.5, 3.0 };
/// //call GetSimilarityScore() to return a double similarity score
/// double similarityScore = sim.GetSimilarityScore(p, q);
/// </code>
<code> tags are always inside of <remarks>. Check other files in the framework.
7) It is nice to document possible exceptions (which are thrown by your code)
using <exception> tag (see framework sources).
In general about docs – see how it is done in the framework. Also nice to
have SandCastle and run it to see what is the output.
Original comment by andrew.k...@gmail.com
on 9 Nov 2010 at 4:22
> 1) Completely disagree about comments #1 and #5 (internal changes).
> There is nothing about N-dimensional arrays in the code.
> <...>
> See WIKI:
> 2 dimensional array means matrix, not 2 elements.
Please note that I said "N-dimensional *vector*", not "N-dimensional *array*".
A vector in N dimensions is not the same thing as an array that has N
dimensions. (-2, 3, 6) might be interpreted as a three dimensional vector of
length 7, or a point in 3-space a distance of 7 from the origin, but it would
never be called a "one-dimensional vector". You and I might choose to represent
it as a one-dimensional array of length 3, but this representation neither
causes the vector to become one-dimensional, nor causes the vector's length to
become 3.
> 3) It is the same like someone told me 5 years ago: “Don’t need to create
> IFilter interface for image processing routine. Just create a static
function.”
This, I believe, is a different case. IFilters store internal data, and
BaseFilter and friends provide significant benefits to filter implementers
(e.g. coalescing all .Apply overloads into one ProcessFilter).
ISimilarity and IDistance implementations (except CustomDistance) require no
internal data and there are no abstract classes to hide (nonexistent) mundane
portions of the implementation.
> ... OK, I see you solved this with delegate.
> What about performance in this case?
I know very little about performance in .NET. What I know about the x86
processor architecture, however, leads me to believe that direct invocation of
a delegate should take about the same amount of time as invocation of an
interface function (call-through-pointer in both cases). Likewise, it does not
surprise me that DynamicInvoke is slower, because it has to do lots of work
before it can perform the call: argument count verification, casting (and its
verification), unboxing (if necessary), and so forth.
Original comment by dales...@gmail.com
on 9 Nov 2010 at 10:21
> Please note that I said "N-dimensional *vector*", not "N-dimensional *array*".
> A vector in N dimensions is not the same thing as an array that has N
dimensions.
> (-2, 3, 6) might be interpreted as a three dimensional vector of length 7, or
a
> point in 3-space a distance of 7 from the origin, but it would never be called
> a "one-dimensional vector".
Aah, ok agree. Just messed between vector and array things. I am just afraid
that this may cause confusion to someone else without extra clarification.
> ISimilarity and IDistance implementations (except CustomDistance) require
> no internal data
Yes, no internal data. However interfaces and abstract classes are not only
about "hiding (nonexistent) mundane portions of the implementation". As I
already mentioned, what if I have a class which may work with any distance
function given? In this case passing interface seems more elegant than a
delegate (at least for me).
Anyway, here are some facts:
1) When to Use Delegates Instead of Interfaces (C# Programming Guide)
http://msdn.microsoft.com/en-us/library/ms173173(VS.80).aspx
From this page it is still arguable what to use - delegate seems to be fine
also.
2) Delegates vs. Interfaces in .NET
http://www.informit.com/articles/article.aspx?p=332881
Here we already can find this: "Interface calls are faster than delegate calls.
An interface reference is a reference to an instance of an object which
implements the interface. An interface call is not that different from an
ordinary virtual call to an method. A delegate reference, on the other hand, is
a reference to a list of method pointers. While invoking a delegate looks like
you're making an indirect call through a method pointer, it's actually a
subroutine call that walks the list of method pointers. The overhead involved
in making the call and walking the list means that delegate invocation can be
two or three times slower than calling a method through an interface reference."
Another thing about interfaces … I can create a class something like
CosineMetric, which implement both IDistance and ISimilarity.
One more. Although it may not be the case here, but with interfaces it is
really easy to find all classes at runtime implementing IDistance using
reflection. Kind of plug-ins system.
Original comment by andrew.k...@gmail.com
on 10 Nov 2010 at 10:18
@kyle.parrigan
Check all your WIKI links. Some links are broken. For example this page does
not exist:
http://en.wikipedia.org/wiki/Euclidean_similarity
Also it seems better to write:
/// <remarks>This class represents the <a href="...">Euclidean Similarity metric</a>
///
instead of:
/// <remarks>This class represents the Euclidean Similarity metric
/// http://en.wikipedia.org/wiki/Euclidean_similarity
Original comment by andrew.k...@gmail.com
on 10 Nov 2010 at 10:21
The Pending status means that update of code is expected. We may discus it here
a lot, but someone needs to take care of the code if we want to merge it into
the framework ;)
Original comment by andrew.k...@gmail.com
on 10 Nov 2010 at 10:23
I'll make the changes for the links/comments tonight. Are we settled on using
the Interface-based approach for now?
Original comment by kyle.par...@gmail.com
on 10 Nov 2010 at 2:22
> Are we settled on using the Interface-based approach for now?
You are part of the discussion, so your opinion matters as well :) I've put my
arguments and I am for interfaces.
Original comment by andrew.k...@gmail.com
on 10 Nov 2010 at 2:28
My opinion hasn't changed a whole lot since my first response. I still think
that, unless we are pretty sure all future distance and similarity methods will
be this simple, an interface-based approach is the way to go. I wasn't aware
that there was a significant performance advantage over using delegates, so
that further reinforces my opinion.
One question about the CustomDistance class. Do you think we really need/want
that functionality? The more I think about it, the more I think it would be
easy enough for a user to just add their own distance class to the project. As
it is now, I think just writing the distance algorithm would be the hardest
part, and it wouldn't really be a big deal for a user to throw it into their
own class that implements IDistance.
Original comment by kyle.par...@gmail.com
on 10 Nov 2010 at 2:44
As I already mentioned, as for me I would create a class implementing certain
interface if I want to have new distance/similarity metric. Maybe a delegate
will be 2 seconds faster to write, but I still would not keep it long term
(only for test maybe). If I need better performance for algorithm which calls
distance metric thousands or millions times, then I would go to interface.
Original comment by andrew.k...@gmail.com
on 10 Nov 2010 at 2:53
Btw, if there any unit tests available, they will be really welcome.
Original comment by andrew.k...@gmail.com
on 10 Nov 2010 at 4:16
Sure thing. I originally wrote unit tests for each method using NUnit, but I
see AForge uses a different unit testing framework. I'll check that out tonight
as well and see what I need to change in my unit tests to integrate them with
AForge's testing library.
Original comment by kyle.par...@gmail.com
on 10 Nov 2010 at 4:20
Sounds like interface implementations are the decision.
With that decision in mind: I don't see a reason for someone to inherit from an
IDifference or ISimilarity implementation. If you don't see one either, I'd
suggest making all the implementing classes sealed. Sealing gives the compiler
more opportunities to make the calls non-virtual, and therefore faster.
Original comment by dales...@gmail.com
on 11 Nov 2010 at 7:28
Sounds good. So we have interfaces and any class implementing them is marked as
sealed.
Original comment by andrew.k...@gmail.com
on 11 Nov 2010 at 8:00
Okay I believe I have all the changes we discussed made. I ran Sandcastle and
verified that the documentation does look right for all the classes. I still
haven't gotten back to the unit tests and setting up Gallio/MbUnit, but I
should have unit tests added in a post over the weekend.
Here is a summary of the changes made. Let me know if I missed anything.
-Cosine Distance now implements IDistance
-Replaced all instances of Math.Pow(x,2) with x*x
-Removed dead variables
-Cleaned up for-loops to avoid accessing length property on each iteration
-Removed Custom Distance class
-Made recommended comment changes for all classes (e.g. fixed/modified URLs,
moved <code> sections, corrected summaries, added exception tags, etc.)
-Changed dimensionality exception message.
-Sealed all distance and similarity classes
Original comment by kyle.par...@gmail.com
on 12 Nov 2010 at 1:09
Attachments:
Committed in revision 1340. However, I found few issues which need to be
resolved. See my comments below ....
Changes:
1) Changed Hamming distance to return number of difference instead of number
of difference divided by length. As we can seen from algorithm description
there is no mapping to [0, 1] range, where 1 is max difference.
2) Put all distances and similarities into single namespace – Metrics. Just
to avoid having too many namespaces with few classes.
3) Both Cosine and Euclidean similarities were not implementing interface, so
put it there.
4) Removing names from source codes. As it was mentioned in one of the
Google’s video casts related to open source projects, let’s avoid claiming
ownership for particular file. I started to remove my name also from those
files, which are edited recently. All the names are listed in Contributors
page, so let’s keep code free of them. Hope you are fine with this.
5) Some minor changes here and there.
Issues to resolve:
1) Current implementation of Cosine Distance seems to implement Cosine
Similarity. There is no difference between its implementation and the formula
given on WIKI for SIMILARITY.
2) In Jaccard Distance, why do we ignore the case, when both p[x] and q[x]
values are equal to 0? What is wrong with 0? As a result of this, suppose we
have two vectors, which have zeros. In this case both intersection and union
will be 0 and their division will give NaN ... By the way, if we start
processing zero, which should be done in my mind, then this distance metric
becomes the same Hamming (only mapped [0, 1], since I removed mapping from
Happing). Any point in it then?
Comments:
1) Both Cosine Distance and Cosine Similarity refer to the same page on WiKi,
which tells how to calculate Similarity, there is nothing said about
Difference. So something should be said about the way you calculate difference.
2) Both Euclidean Distance and Euclidean Similarity refer to the same page on
Wiki, which tells only about distance. So need clarification about your way of
calculating similarity.
Original comment by andrew.k...@gmail.com
on 12 Nov 2010 at 1:40
Since the code is in SVN now, the next update would be better to have as a
patch. Tests may come as new files.
Original comment by andrew.k...@gmail.com
on 12 Nov 2010 at 1:41
For both items in the "Comments" section, and item #1 in the "Issues to
Resolve": In both Cosine and Euclidean distance the similarity is just 1 minus
the distance since we want a higher similarity score for closer distances. I'd
be happy to add comments stating that if it is needed.
For #2 in "Issues to resolve": I ignored the zeros because Matlab's pdist
function calculates the distance that way (so it was easier for me to test). I
believe Wikipedia also shows that method for the distance when calculating for
binary vectors. However, we're not neccessarily dealing with binary vectors,
and there's no need to conform to Matlab's implementation, so I could go either
way. If we keep it as is, I think a check to make sure divide by zero isn't
possible should be added.
http://www.mathworks.com/help/toolbox/stats/pdist.html
http://en.wikipedia.org/wiki/Jaccard_index
Original comment by kyle.par...@gmail.com
on 3 Dec 2010 at 2:21
> In both Cosine and Euclidean distance the similarity is just 1 minus the
distance
> since we want a higher similarity score for closer distances. I'd be happy to
add
> comments stating that if it is needed.
Yes, but I still think the code is wrong for Cosine distance and similarity.
Here is a simple test:
AForge.Math.Metrics.CosineDistance distance = new
AForge.Math.Metrics.CosineDistance( );
AForge.Math.Metrics.CosineSimilarity similarity = new
AForge.Math.Metrics.CosineSimilarity( );
double[] p = { 1, 1, 1 };
double[] q = { 1, 1, 1 };
double d = distance.GetDistance( p, q );
double s = similarity.GetSimilarityScore( p, q );
This code will give you distance equal to 1, but similarity equal to 0. This is
what I told before - you've messed them. It should be exactly the opposite.
Original comment by andrew.k...@gmail.com
on 3 Dec 2010 at 2:36
Oh, yep. You're absolutely right. I'll add a patch for that and the jaccard
distance over the weekend.
Original comment by kyle.par...@gmail.com
on 3 Dec 2010 at 2:48
Here is the patch for Cosine Similarity/Distance.
Original comment by kyle.par...@gmail.com
on 4 Dec 2010 at 3:14
Attachments:
Here is the patch for Jaccard distance. I maintained the 0 comparison, and
added a check to make sure we can never divide by zero. Regarding your earlier
comment: unless we go back to the original version of Hamming and change
Jaccard to include 0, then yes, I think we should keep Jaccard.
Original comment by kyle.par...@gmail.com
on 5 Dec 2010 at 11:02
Attachments:
Committed in revision 1356.
Will any tests come?
Original comment by andrew.k...@gmail.com
on 6 Dec 2010 at 2:37
You must have read my mind because I was just talking about those with a
co-worker a few minutes ago. Yes, I should have unit tests for you tonight or
tomorrow night at the latest.
Original comment by kyle.par...@gmail.com
on 6 Dec 2010 at 2:48
Attached are unit tests for each distance and similarity metric. Also, I've
attached a patch file addressing the same "divide by zero" scenario in Pearson
and Cosine similarity that we addressed in Jaccard distance.
Original comment by kyle.par...@gmail.com
on 8 Dec 2010 at 2:31
Attachments:
Committed in revision 1357. Seems like this is it about this ticket ...
Original comment by andrew.k...@gmail.com
on 13 Dec 2010 at 3:19
Original comment by andrew.k...@gmail.com
on 12 Jan 2011 at 11:44
Original issue reported on code.google.com by
kyle.par...@gmail.com
on 3 Nov 2010 at 12:22Attachments: