Closed kno10 closed 2 years ago
Erich,
Thanks for the input. I will attempt to address some of these piecewise, however likely not all within this comment.
You do not give details on the indexes chosen in the comparison methods. From the runtime numbers, it appears that you did not add enable an index in ELKI?
- No index was specified--it was left to the default. I assumed a proper indexing, such as a regular kd tree, was chosen using some heuristic quickly computed on the data set or preemptively set per algorithm. This might correspond well with someone who is new to ELKI. Does ELKI, by default with no other settings, precompute the O(n^2) distances for something like DBSCAN or OPTICS? Looking at actual timings of other software platforms that do precompute the distance matrix (FPC), this does not appear to be the case. I do agree that the details of this should perhaps be better explained in this detail. I appreciate the input.
Do the runtimes include JVM startup cost, R startup cost, or did you use internal runtime measurements (e.g. -time).
- The benchmarks were recorded by recording the "runtime" time recorded in ELKI's output using the -time flag. As I understand, this was the time from the start of the algorithm to the end of it, not including any preprocessing java runtime steps, which I believe is fair.
Furthermore, benchmarks with such tiny data sets (<=8000 instances)
- The size of the data sets tested seems quite relative. There are many data sets that fall far below 8000 data points. Every point on the results graphs represents the average of 15 replicated timings. Collectively, these took multiple days to prepare, execute, aggregate, and summarize into a graph. Also, recall that not all of the software uses spatial indexing structures. Going towards ~30k mark, on the software platforms that go without spatial indexing that are O(n^2), each benchmark run starts requiring billions of distance computations. As stated in the paper, these are being performed on commodity-grade hardware, and that's simply impractical.
... are likely not very informative.
- I think we would both agree that benchmarks may be deemed "not informative" are categorically so because there's evidence or reason to believe that random chance played a significant role in the results. As stated above, each point in the graph is the aggregate of 15 replicated runs. Furthermore, many of the benchmarks, across all software platforms tested, were fairly monotonic with respect to their size/dimensionality compared. What I mean by that is, given a data set D1 of size n and a data set D2 of size m, if m > n, generally speaking, most if not all of the benchmarks reported larger evaluation times for D2 relative to their performance for D1. I would think that, by definition, this supports the notion that such benchmarks are informative.
Sorry that I have not yet prepared a technical report on the filtering step. It has some reasonable theory behind it, but it is too incremental as that a major conference would publish this, so there only exists a draft report.
Let me start off by saying that the below reflects my own opinion, and not necessarily the opinion of the paper.
other distances than Euclidean can be used, but will be a lot slower (in this package). Demonstrate how to use Haversine distance for geo coordinates, because this is a very important use case.
- I agree that other distances represent potentially important use case, and should be mentioned, but adding other metrics/distances to the current benchmark seems beyond the scope of the paper.
Overall, once again I appreciate you took the time to read over the manual and give feedback.
ELKI does indeed not automatically add an index, because depending on the data sets and parameters, indexes may also reduce performance (for example for very high dimensional data). Because we have many different choices, and there is little research on when to use which index, we do not take this responsibility away from the user; everyone just assumes there would be a good default but there probably isn't.
Furthermore, when running more than one algorithm, choosing the optimal index would need to know the future. For example, when running an algorithm for the k=1..100 nearest neighbors, it is a good idea to precompute the k=100 nearest neighbors once as a manual indexing step, then run the algorithms against this "index". I always wanted to add an optimizer to ELKI to at least do this for the command line use, check the query requirements of all algorithms to come, then heuristically add an index (unless the optimizer is disabled) or e.g. precompute with the maximum k.
ELKI doesn't compute a distance matrix either, which would usually need to much memory. By default, it simply will do a linear scan for every query; that computes every distance twice, but needs only O(n) memory. If you want to use a distance matrix, you can add such an "index", too. Which is a good idea if you plan on e.g. evaluating with Silhouette afterwards, which then can reuse the same distance matrix automatically.
Yes, -time
is the best way of benchmarking; don't enable -verbose
at the same time (because this causes additional I/O), and because the index is added before, you probably should also include the index construction time, if it is included for others.
As you didn't use indexes yet, please also include ELKI with the cover tree index; which is what I currently suggest to try first based on my personal experience (for anything that satisfies the triangle inequality).
fpc
is, as far as I can tell, slow because it is largely R code, not C++ code like dbscan
. This means it could be more flexible wrt. to e.g. other distances, but it will be a lot slower. (Well fpc
likely will be slower than dbscan
for a actual use, so I don't think it offers a real advantage). Maybe I am also confusing it with the flexclust
R package, which was also unexpectedly slow.
For non-indexed runs, I agree that stopping at roughly 10-20k objects is reasonable; which is just one more reason to distinguish between indexed and non-indexed experiments. Many implementations will run out of memory at ~32k anyway. But you have to admit that 8k is still a very small data set. Given the relatively low variance of the results, 15 iterations may simply be overdoing it for non-randomized algorithms. I have done similar large experiments on commodity hardware, see the "The (black) art of runtime evaluation" paper. It takes an incredible amount of CPU time in particular with some buggy or slow implementations in your candidate set, and once you compare different versions like we did there. Initially, we also wanted to measure the impact of the JVM version, but that would have taken even longer. We used 11 runs each, and the published results are about 160 days aggregated (on commodity hardware).
if m > n, generally speaking, most if not all of the benchmarks reported larger evaluation times for D2 relative to their performance for D1."
Well, you don't need experiments that merely show that runtime is increasing with data set size, do you? I doubt any user assumes runtime remains constant; and 8k may be barely enough to distinguish O(n^2) from O(n^2 log n) because of constant factors (see e.g. Figure 1 in the black art paper). Excluding "unbearably slow" implementations from the large-data experiments is reasonable; and as mentioned before, anything distance matrix based will run out of memory for large data anyway.
The Xi comment was just to let you know that I plan on writing this down to explain the formal benefits; and that I appreciate that you use it even though it is not published anywhere yet (because it makes OPTICS results much less "odd"); and that I hope that once I have published the tech report you can cite it then. The resulting theory is quite similar to the ideas underlying HDBSCAN, actually, and it suggests the result should be treated as a dendrogram tree rather than a linear "cluster order".
Your benchmark results clearly won't transfer to other distances; that is worth noting, isn't it? Even if you don't include additional experiments. Because you know the results do no longer apply then because of the k-d-tree. If you include dbscan
results with the k-d-tree disabled (maybe even with both linear
and dist
separately), you can mention that other distances are expected to perform like this, so people are not surprised about the performance differences. I know that dbscan
will outperform ELKI with Euclidean distance, because the ANN library is just very very well optimized (I have used it myself); but I'd like people to understand what a difference an index vs. no index can make, and because of this use indexes more often (and which is largely why the dbscan
package exists, to be able to use ANN for data mining in R, isn't it?).
ELKI does indeed not automatically add an index, because depending on the data sets and parameters, indexes may also reduce performance (for example for very high dimensional data). Because we have many different choices, and there is little research on when to use which index, we do not take this responsibility away from the user; everyone just assumes there would be a good default but there probably isn't.
- This is surprising to me.
I suggest to make separate benchmarks with and without indexing As you didn't use indexes yet, please also include ELKI with the cover tree index
- I will test against other indexing schemes in ELKI such as the cover tree.
We used 11 runs each, and the published results are about 160 days aggregated (on commodity hardware).
- Yes, this is a very comprehensive result. However, the entire paper you have is about runtime evaluation. This paper is about a brief look on how to use the
dbscan
, why the results are useful for the R community. Not only do I not have the time to approach such an endeavor, it's far beyond the scope of this paper.ELKI doesn't compute a distance matrix either, which would usually need to much memory. By default, it simply will do a linear scan for every query; that computes every distance twice, but needs only O(n) memory. If you want to use a distance matrix, you can add such an "index", too. Which is a good idea if you plan on e.g. evaluating with Silhouette afterwards, which then can reuse the same distance matrix automatically.
- Sure. I shouldn't have used the word "matrix".
dbscan
, likedist
, does not by default compute the full distance matrix for naive implementations either, but the linearized lower triangular of course. Computing the n choose 2 still categorizes into the O(n^2) runtime complexity tier.Well, you don't need experiments that merely show that runtime is increasing with data set size, do you?
- You suggested the benchmarks performed were essentially mute because they were "likely not very informative." Like you said, the variance of the benchmarks is small, and thus benchmarks are informative. I admit introducing details regarding the indexing used would be useful, but you were referring to the size of the data sets being small and thus the benchmarks being insignificant, implying that the results may be due to random chance. It's true that the benchmarks should not only show that runtime is increasing with data set size, but it's a good sign when they do, because it indicates the results were not by chance. Dismissing small data set sizes as insignificant is slippery slope. There may be several cases where one would like to run an algorithm, say DBSCAN, potentially very many times for e.g. parameter-tuning purposes. And as Grace Hopper would say, "Mind your Nanoseconds!"
Your benchmark results clearly won't transfer to other distances; that is worth noting, isn't it?
- It is stated up-front the
dbscan
, in nearly every description or documentation snippet of the package, that it relies on the ANN library, which is well known for only supporting for Minkowski metrics. It's worth noting that the performance may not be reflected in non-euclidean cases, sure.but I'd like people to understand what a difference an index vs. no index can make, and because of this use indexes more often (and which is largely why the dbscan package exists, to be able to use ANN for data mining in R, isn't it?)
Actually, no, the purpose of the dbscan
package is to provide the R platform with relatively efficient and correct algorithms (and related extensions) within the dbscan-family of algorithms. There are other packages that aim to provide specialty indexing structures for neighbor acceleration.
Yes, in some situations multiple runs may be necessary. But for explorative methods like clustering, this is much less important than for supervised learning, where you can automatically evaluate the results to perform parameter tuning. In unsupervised learning, this will usually not be possible, so this shouldn't be your primary usage scenario; and if this is the (unexpected!) scenario, it should be made explicit. Either way, in these cases, the results for JVM implementations will still be misleading, because for subsequent runs the JVM would already be warmed up, and likely much faster when run multiple times in the same JVM. A similar effect may happen with other JIT compiled languages. For example on the t4.8k data set, on my laptop, if I set ELKI to run DBSCAN 5 times, the first iteration takes 380 ms, subsequent runs only 260 ms. At run times this short, the hotspot optimization does make a measurable difference. With a large enough data set, where the runtime is several seconds, this usually is much less of an issue. You can see a supposedly similar effect with Rclusterpp in our single-link benchmark. Until about 1000 instances it is one of the slowest because of about 1 second of not-accounted-for startup cost (we try to account for startup, library loading, and file IO cost, but this did not work for all cases); but for larger data sets it is the fastest. I wouldn't be surprised if this is a one-time penalty. The results at >10^4 instances (resp. >10 seconds) appear to be much more reliable, for the smaller data it may be necessary to use microbenchmarks because of these effects.
ELKI 0.8.0 will now auto-create indexes with some simple heuristics. So it would be appropriate if you update the figures with new results. Thank you.
Hi, the comparison was performed with ELKI version 0.7 and dbscan 0.9-8 (versions are stated in the document) and was published in JSS in 2019. Both programs have been updated since then. Please perform benchmarks with the new versions and publish them if necessary.
We already published fairer benchmark results in 2017 (in particular, Figure 3 that shows the change across versions): https://link.springer.com/article/10.1007/s10115-016-1004-2 of course not yet including ELKI 0.8.0 back then, but the paper clearly emphasizes that performance can change with new versions and depends on indexing as well as parameterization. Nevertheless, I see people believe that the R version would be "much faster" than sklearn or ELKI. Likely because of this package's documentation: https://github.com/mhahsler/dbscan/blob/023add8ec2123a1205baff61182cb23653dbb42a/vignettes/dbscan.Rnw#L911-L920 that could use an update.
I see. I plan to change the vignette as follows:
In 2019, we performed a comparison between \pkg{dbscan} 0.9-8, \pkg{fpc} 2.1-10, ELKI version 0.7, PyClustering 0.6.6, SPMF v2.10, WEKA 3.8.0, SciKit-Learn 0.17.1 on a MacBook Pro equipped with a 2.5 GHz Intel Core i7 processor, running OS X El Capitan 10.11.6. Note that newer versions of all mentioned software packages have been released since then. Changes in data structures and added optimization may result in significant improvements in runtime for different packages.
Definitely is an improvement, thank you. I would suggest to also include that dimensionality, distance function, data set size, and other data characteristics will have a substantial impact on runtime performance. That is often overlooked by beginners in data analysis. If they try to, e.g., cluster high-dimensional word embeddings.
I have added the additional info. It is now in the development version on Github and will be part of the next release.
You do not give details on the indexes chosen in the comparison methods. From the runtime numbers, it appears that you did not add enable an index in ELKI? Do the runtimes include JVM startup cost, R startup cost, or did you use internal runtime measurements (e.g.
-time
). https://rdrr.io/cran/dbscan/f/inst/doc/dbscan.pdfI suggest to make separate benchmarks with and without indexing, and experiment with different indexes such as the cover tree which performed quite well in other studies. Benchmarking an implementation without index vs. one with index obviously is unfair. For OPTICS, I suggest you use a small epsilon whenever using an index, for obvious reasons.
Furthermore, benchmarks with such tiny data sets (<=8000 instances) are likely not very informative. Startup costs such as the JVM hotspot compilation will contribute substantially. Here is one slightly larger (but still small!), but real data and 13k instances: https://cs.joensuu.fi/sipu/datasets/MopsiLocationsUntil2012-Finland.txt and with appropriate parameters this should work with both DBSCAN and OPTICS. For more meaningful results, I suggest to use at minimum 30.000-50.000 instances of real data, e.g. coordinates from Tweets (supposedly not shareable) or Flickr images (could be shareable?). With indexes, the data size should probably go up to a million points.
Also the R dbscan package documentation should be more upfront about the performance strongly depending on the use of Euclidean distance because of the kd-tree. The ANN library is great, but this is a substantial limitation. For a tweet locations example, Haversine distance is much more appropriate.
Sorry that I have not yet prepared a technical report on the filtering step. It has some reasonable theory behind it, but it is too incremental as that a major conference would publish this, so there only exists a draft report.
Both will be very valuable for users, and therefore belong into the manual