Support for parameter dependencies

I'm looking for a good tool to benchmark ELKI ( http://elki.dbs.ifi.lmu.de/ ) clustering performance across parameters.

The problem is, that the parameters aren't as nicely uniform as in your examples, and they have strong interdependencies.

The most interesting parameter obviously is the clustering algorithm. Say I'm looking only at k-means and DBSCAN for this example (but there are tons more in ELKI, which is why I could need benchmarking tool support).

k-means has the key parameters "kmeans.k" (the number of clusters) and the initialization method. Randomized initialization methods will also have a seed parameter, to fix the random seed.
for DBSCAN, the key parameters are the distance function, the radius epsilon (which depends a lot on the distance function), and minPts which interplays with the radius: a larger radius will need a larger minPts.

The big challenge here are the dependencies of the parameters. The most simple one is that the "k" parameter only exists for k-means, whereas for DBSCAN one needs to choose distance function, minPts and epsilon. But then, there are also k-means initialization heuristics that have parameters such as the random seed...

Will 3x be able to handle such complex cases?

Thanks for your input.

Unfortunately, 3X does not support such nested/dependent parameters explicitly yet. However, I think there is a relatively simple way to emulate them for now without losing much functionality of the tool. By defining all dependent parameters at the top level without any special structure, and assigning a special value (e.g., null or undef) to all irrelevant parameters, you can achieve similar effect of having dependent parameters.

In your example of benchmarking ELKI, you could define an additional null value for all dependent input parameters:

algorithm
- kmeans
- DBSCAN
- ...
k
- null
- 3
- 4
- ...
distance
- null
- Euclidean
- ...
...

Because all input parameter values will be available to your program as environment variables, you can easily grab the values of the relevant dependent parameters based on what value algorithm is set to (e.g., value of k when algorithm=kmeans). Of course, since 3X won't take care of invalid input combinations (e.g., algorithm=DBSCAN k=3 distance=Euclidean or algorithm=kmeans k=null distance=null), you will need to put extra care when using some features, such as generating a full combination (cross product) of parameter values for planning runs. However, many features will still work well with such flattened parameter space, such as charting some output metric across algorithms, or charting the effect of a dependent parameter of a particular algorithm.

Our initial plan related to this issue was to provide a way to have user-defined, general constraints over the input parameter space, so 3X can automatically rule out invalid cases. However, I now see that supporting dependent (or hierarchical) parameters can be more intuitive, and have a lot of use cases in the data mining and machine learning domain.

I will keep this issue open to collect more concrete ideas until we make 3X handle dependent parameters natively.

netj / 3x

Support for parameter dependencies #1