mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
314 stars 64 forks source link

Document parameters better when benchmarking #16

Closed kno10 closed 2 years ago

kno10 commented 7 years ago

You do not give details on the indexes chosen in the comparison methods. From the runtime numbers, it appears that you did not add enable an index in ELKI? Do the runtimes include JVM startup cost, R startup cost, or did you use internal runtime measurements (e.g. -time). https://rdrr.io/cran/dbscan/f/inst/doc/dbscan.pdf

I suggest to make separate benchmarks with and without indexing, and experiment with different indexes such as the cover tree which performed quite well in other studies. Benchmarking an implementation without index vs. one with index obviously is unfair. For OPTICS, I suggest you use a small epsilon whenever using an index, for obvious reasons.

Furthermore, benchmarks with such tiny data sets (<=8000 instances) are likely not very informative. Startup costs such as the JVM hotspot compilation will contribute substantially. Here is one slightly larger (but still small!), but real data and 13k instances: https://cs.joensuu.fi/sipu/datasets/MopsiLocationsUntil2012-Finland.txt and with appropriate parameters this should work with both DBSCAN and OPTICS. For more meaningful results, I suggest to use at minimum 30.000-50.000 instances of real data, e.g. coordinates from Tweets (supposedly not shareable) or Flickr images (could be shareable?). With indexes, the data size should probably go up to a million points.

Also the R dbscan package documentation should be more upfront about the performance strongly depending on the use of Euclidean distance because of the kd-tree. The ANN library is great, but this is a substantial limitation. For a tweet locations example, Haversine distance is much more appropriate.

Sorry that I have not yet prepared a technical report on the filtering step. It has some reasonable theory behind it, but it is too incremental as that a major conference would publish this, so there only exists a draft report.

Both will be very valuable for users, and therefore belong into the manual

peekxc commented 7 years ago

Erich,

Thanks for the input. I will attempt to address some of these piecewise, however likely not all within this comment.

You do not give details on the indexes chosen in the comparison methods. From the runtime numbers, it appears that you did not add enable an index in ELKI?

  • No index was specified--it was left to the default. I assumed a proper indexing, such as a regular kd tree, was chosen using some heuristic quickly computed on the data set or preemptively set per algorithm. This might correspond well with someone who is new to ELKI. Does ELKI, by default with no other settings, precompute the O(n^2) distances for something like DBSCAN or OPTICS? Looking at actual timings of other software platforms that do precompute the distance matrix (FPC), this does not appear to be the case. I do agree that the details of this should perhaps be better explained in this detail. I appreciate the input.

Do the runtimes include JVM startup cost, R startup cost, or did you use internal runtime measurements (e.g. -time).

  • The benchmarks were recorded by recording the "runtime" time recorded in ELKI's output using the -time flag. As I understand, this was the time from the start of the algorithm to the end of it, not including any preprocessing java runtime steps, which I believe is fair.

Furthermore, benchmarks with such tiny data sets (<=8000 instances)

  • The size of the data sets tested seems quite relative. There are many data sets that fall far below 8000 data points. Every point on the results graphs represents the average of 15 replicated timings. Collectively, these took multiple days to prepare, execute, aggregate, and summarize into a graph. Also, recall that not all of the software uses spatial indexing structures. Going towards ~30k mark, on the software platforms that go without spatial indexing that are O(n^2), each benchmark run starts requiring billions of distance computations. As stated in the paper, these are being performed on commodity-grade hardware, and that's simply impractical.

... are likely not very informative.

  • I think we would both agree that benchmarks may be deemed "not informative" are categorically so because there's evidence or reason to believe that random chance played a significant role in the results. As stated above, each point in the graph is the aggregate of 15 replicated runs. Furthermore, many of the benchmarks, across all software platforms tested, were fairly monotonic with respect to their size/dimensionality compared. What I mean by that is, given a data set D1 of size n and a data set D2 of size m, if m > n, generally speaking, most if not all of the benchmarks reported larger evaluation times for D2 relative to their performance for D1. I would think that, by definition, this supports the notion that such benchmarks are informative.

Sorry that I have not yet prepared a technical report on the filtering step. It has some reasonable theory behind it, but it is too incremental as that a major conference would publish this, so there only exists a draft report.

Let me start off by saying that the below reflects my own opinion, and not necessarily the opinion of the paper.

other distances than Euclidean can be used, but will be a lot slower (in this package). Demonstrate how to use Haversine distance for geo coordinates, because this is a very important use case.

  • I agree that other distances represent potentially important use case, and should be mentioned, but adding other metrics/distances to the current benchmark seems beyond the scope of the paper.

Overall, once again I appreciate you took the time to read over the manual and give feedback.

kno10 commented 7 years ago
peekxc commented 7 years ago

ELKI does indeed not automatically add an index, because depending on the data sets and parameters, indexes may also reduce performance (for example for very high dimensional data). Because we have many different choices, and there is little research on when to use which index, we do not take this responsibility away from the user; everyone just assumes there would be a good default but there probably isn't.

  • This is surprising to me.

I suggest to make separate benchmarks with and without indexing As you didn't use indexes yet, please also include ELKI with the cover tree index

  • I will test against other indexing schemes in ELKI such as the cover tree.

We used 11 runs each, and the published results are about 160 days aggregated (on commodity hardware).

  • Yes, this is a very comprehensive result. However, the entire paper you have is about runtime evaluation. This paper is about a brief look on how to use the dbscan, why the results are useful for the R community. Not only do I not have the time to approach such an endeavor, it's far beyond the scope of this paper.

ELKI doesn't compute a distance matrix either, which would usually need to much memory. By default, it simply will do a linear scan for every query; that computes every distance twice, but needs only O(n) memory. If you want to use a distance matrix, you can add such an "index", too. Which is a good idea if you plan on e.g. evaluating with Silhouette afterwards, which then can reuse the same distance matrix automatically.

  • Sure. I shouldn't have used the word "matrix". dbscan, like dist, does not by default compute the full distance matrix for naive implementations either, but the linearized lower triangular of course. Computing the n choose 2 still categorizes into the O(n^2) runtime complexity tier.

Well, you don't need experiments that merely show that runtime is increasing with data set size, do you?

  • You suggested the benchmarks performed were essentially mute because they were "likely not very informative." Like you said, the variance of the benchmarks is small, and thus benchmarks are informative. I admit introducing details regarding the indexing used would be useful, but you were referring to the size of the data sets being small and thus the benchmarks being insignificant, implying that the results may be due to random chance. It's true that the benchmarks should not only show that runtime is increasing with data set size, but it's a good sign when they do, because it indicates the results were not by chance. Dismissing small data set sizes as insignificant is slippery slope. There may be several cases where one would like to run an algorithm, say DBSCAN, potentially very many times for e.g. parameter-tuning purposes. And as Grace Hopper would say, "Mind your Nanoseconds!"

Your benchmark results clearly won't transfer to other distances; that is worth noting, isn't it?

  • It is stated up-front the dbscan, in nearly every description or documentation snippet of the package, that it relies on the ANN library, which is well known for only supporting for Minkowski metrics. It's worth noting that the performance may not be reflected in non-euclidean cases, sure.

but I'd like people to understand what a difference an index vs. no index can make, and because of this use indexes more often (and which is largely why the dbscan package exists, to be able to use ANN for data mining in R, isn't it?)

Actually, no, the purpose of the dbscan package is to provide the R platform with relatively efficient and correct algorithms (and related extensions) within the dbscan-family of algorithms. There are other packages that aim to provide specialty indexing structures for neighbor acceleration.

kno10 commented 7 years ago

Yes, in some situations multiple runs may be necessary. But for explorative methods like clustering, this is much less important than for supervised learning, where you can automatically evaluate the results to perform parameter tuning. In unsupervised learning, this will usually not be possible, so this shouldn't be your primary usage scenario; and if this is the (unexpected!) scenario, it should be made explicit. Either way, in these cases, the results for JVM implementations will still be misleading, because for subsequent runs the JVM would already be warmed up, and likely much faster when run multiple times in the same JVM. A similar effect may happen with other JIT compiled languages. For example on the t4.8k data set, on my laptop, if I set ELKI to run DBSCAN 5 times, the first iteration takes 380 ms, subsequent runs only 260 ms. At run times this short, the hotspot optimization does make a measurable difference. With a large enough data set, where the runtime is several seconds, this usually is much less of an issue. You can see a supposedly similar effect with Rclusterpp in our single-link benchmark. Until about 1000 instances it is one of the slowest because of about 1 second of not-accounted-for startup cost (we try to account for startup, library loading, and file IO cost, but this did not work for all cases); but for larger data sets it is the fastest. I wouldn't be surprised if this is a one-time penalty. The results at >10^4 instances (resp. >10 seconds) appear to be much more reliable, for the smaller data it may be necessary to use microbenchmarks because of these effects.

kno10 commented 1 year ago

ELKI 0.8.0 will now auto-create indexes with some simple heuristics. So it would be appropriate if you update the figures with new results. Thank you.

mhahsler commented 1 year ago

Hi, the comparison was performed with ELKI version 0.7 and dbscan 0.9-8 (versions are stated in the document) and was published in JSS in 2019. Both programs have been updated since then. Please perform benchmarks with the new versions and publish them if necessary.

kno10 commented 1 year ago

We already published fairer benchmark results in 2017 (in particular, Figure 3 that shows the change across versions): https://link.springer.com/article/10.1007/s10115-016-1004-2 of course not yet including ELKI 0.8.0 back then, but the paper clearly emphasizes that performance can change with new versions and depends on indexing as well as parameterization. Nevertheless, I see people believe that the R version would be "much faster" than sklearn or ELKI. Likely because of this package's documentation: https://github.com/mhahsler/dbscan/blob/023add8ec2123a1205baff61182cb23653dbb42a/vignettes/dbscan.Rnw#L911-L920 that could use an update.

mhahsler commented 1 year ago

I see. I plan to change the vignette as follows:

In 2019, we performed a comparison between \pkg{dbscan} 0.9-8, \pkg{fpc} 2.1-10, ELKI version 0.7, PyClustering 0.6.6, SPMF v2.10, WEKA 3.8.0, SciKit-Learn 0.17.1 on a MacBook Pro equipped with a 2.5 GHz Intel Core i7 processor, running OS X El Capitan 10.11.6. Note that newer versions of all mentioned software packages have been released since then. Changes in data structures and added optimization may result in significant improvements in runtime for different packages.

kno10 commented 1 year ago

Definitely is an improvement, thank you. I would suggest to also include that dimensionality, distance function, data set size, and other data characteristics will have a substantial impact on runtime performance. That is often overlooked by beginners in data analysis. If they try to, e.g., cluster high-dimensional word embeddings.

mhahsler commented 1 year ago

I have added the additional info. It is now in the development version on Github and will be part of the next release.