vickumar1981 / stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html
Other
78 stars 15 forks source link

Benchmarking: Add benchmarks for java utils package #67

Closed vickumar1981 closed 3 years ago

vickumar1981 commented 3 years ago

Current guidance on benchmarks are in the CONTRIBUTING.MD: https://github.com/vickumar1981/stringdistance/blob/master/CONTRIBUTING.md#6-running-benchmarks

These benchmark the Scala implementations only. An open question/concern that I have had is if there is a hidden boxing cost when using the java wrapper types Double and Int, instead of the primitive alternatives.

https://dzone.com/articles/java-performance-notes-autoboxing-unboxing

The 1st step, I think, would be to setup the tantamount benchmark tests for the java classes and compare if using the primitive types int and double would be faster.

tg44 commented 3 years ago

I played with this a bit... First of all, we are not benchmarking our functions right now. We benchmarking the Array.fill + our function. I started to dig deeper and deeper, and my conclusion is that we can't measure out the autoboxing easily, generalization is not in our side... (We need to wrap/generalize the return values (the whole point is not doing this), we need to wrap/generalize the input values (which leads to a false measure).) I'm still digging deeper and trying to come up with a "good" solution...

BTW autoboxing is usually not a performance issue. I didn't dig deep into the codebase yet, but I would be surprised if the possibly two boxing (we box it, but the user needs an unboxed value) had any measurable performance diff.

For example watch the .toSeq method on Arrays;

 @`inline` final def toSeq: immutable.Seq[A] = toIndexedSeq

 def toIndexedSeq: immutable.IndexedSeq[A] =
    immutable.ArraySeq.unsafeWrapArray(Array.copyOf(xs, xs.length))

For convenience we copy the whole array. A full array copy is at least as much time as boxing, but probably much more.

The post runs a 2x10^9 cycle, and measures that a boxing+add is about 6 times pricier than a single add, which is still in an 5ns range.

I would not consider the last boxing as a threat.

(And started to fix the benchmarks to make them more banchmarky.)

vickumar1981 commented 3 years ago

@tg44 do you think we should replace this issue w/ other issues that better outline an approach towards benchmarking? feel free to open up any issues as you see fit, btw.

tg44 commented 3 years ago

I opend up a Pr which fixes the issues I found, although I like to hate benchmarks 😛

The problem with "just in case" benchmarking that it is easy to mess up (like we measured the array.fill here), and it is not really compareable (bcs we have nothing to compare to and we has no good purpose to measure. We just generate data and not information).

The problem with comparable/comperative benchmarking is that we sometimes measure different things that we really want to or think to measure (a good example; https://gist.github.com/djspiewak/f4cfc08e0827088f17032e0e9099d292).

I think the only reason we should benchmark if we found a potential bottleneck, and want to prove it.

I think the current benchmarking in my PR is good enough to make some assumptions about the algo speeds (if there is a 5 times slower than an other we could show with this), but imho in general it is not worth to add much more effort to it.

vickumar1981 commented 3 years ago

:+1: closing this issue out.

addressed by: