Benchmark effect of model parameters

pbailis commented 8 years ago

Independent Variables

In rough order, we want to sweep each of the following variables:

To evaluate the model behavior:

[x] h in MCD in both streaming and batch (e.g., 1, .95, .75, .5, .25, .1, .05, .01)
[x] stoppingDelta in MCD in both streaming and batch (e.g., 1e0, 1e-1, 1e-2, 1-e3, 1e-4)
[x] sample reservoir size in streaming (e.g., 100, 1000, 10000, 100000)
[x] refreshPeriod in streaming (e.g., 10 tuples, 100, 1000, 10000, 100000)

To evaluate the summary behavior:

[x] support in both streaming and batch
[x] ratio in both streaming and batch
[x] outlier percentile in both streaming and batch

(Note: stoppingDelta will require a configuration file change.)

Independent Variables

The figures of interest we want to measure (and eventually plot):

[x] Wall-clock performance for each component and overall
[x] Number of itemsets
[ ] For the model behaviors, several percentiles of outlier scores (e.g., 50th, 75th, 95th, 99th).
- This allows us to determine how the distribution of scores changes as a result of changes to the underlying model.
- It's unclear where it's best to implement this. One thought is to do it optionally at the end of a given run.

pbailis commented 8 years ago

A few thoughts so far:

Can we record the entire runtime, excluding loading?
For precision/recall, what if we set the baseline to be either the min point or the max point in the plot? For example, for alphaMCD, we'd set 1.0 as the baseline. For minSupport, we'd choose the smallest value as the baseline.

deepakn94 commented 8 years ago

You mean the streaming case for alphaMCD = 1.0? Why is that a good baseline? Seems fairly arbitrary to me.

deepakn94 commented 8 years ago

What if there were no itemsets detected for that case? Does that mean we truly believe that there were no itemsets to detect for that workload, regardless of the choice of alpha?

pbailis commented 8 years ago

Keep in mind that we're looking for relative comparisons.

In the case of alpha, 1.0 might be extreme, but, generally, higher alpha (at least up to .5) indicates we are looking at more data in the estimate. Alpha = 0.5 should be a better estimator than alpha = 0.01.

In the case of reservoir sample sizes, larger is almost always better insofar as reservoir size << update rate.

deepakn94 commented 8 years ago

I guess.

pbailis commented 8 years ago

We need a way to normalize across each dataset on the graph. Given that the streaming runs aren't an apples-to-apples comparison with the batch run, I think it makes more sense to declare one of the streaming runs as a baseline. Do you see an alternative?

deepakn94 commented 8 years ago

Ideally, freeze a golden set of itemsets a-priori, and then always evaluate against that. Given a workload, there is always a set of items that are associated with the "true" outliers that are not associated with the "true" inliers. The problem I have with deciding the gold set on the fly is that outlier detection can be very wrong in a lot of these cases, so deciding your gold set against some wrong results seems flawed.

deepakn94 commented 8 years ago

Comparing against the batch case seems like a somewhat-ok proxy for what I described above. (note that I actually think comparing against what's produced by the baseline parameters in the batch case may be even better) In general, I think moving the gold set for different parameter setting is very dangerous. The true answers shouldn't really vary acording to what parameters you're evaluating.

pbailis commented 8 years ago

The only caveat is that the streaming case is a different model; as you know, it uses exponentially damped windows to age out the older tuples. Is your proposal that we disable the decay?

deepakn94 commented 8 years ago

It shouldn't matter if it's a different model. If you had a SVM model and a neural network doing the same classification task, you would have the same set of gold labels when evaluating performance. I'm proposing we have some way of determining the gold set of labels (manual annotation, looking at the results from batching on a certain set of parameters, etc) and then we stick to those labels for that workload through the entire process. Accuracy is not something that should be calculated relative to something (at least relative to something that isn't a constant) -- especially when we have no reason to believe what we compared against is truly correct. Now, you might say that the set of outliers produced would be different in the streaming and batch case. That is true; in fact, we are penalizing streaming for not being able to look at all the data. That's fine though: we're compromising accuracy for performance here; by not having to look at all the examples, we can achieve much higher throughputs.

pbailis commented 8 years ago

I agree with what you're saying above. However, the actual "learning" task here is different in the streaming and batch case when decayRate < 1. If we want a real apples-to-apples comparison, we need to set decayRate to 1. Does that make sense?

pbailis commented 8 years ago

By "model" I meant execution model, not ML model. The objective of the exponentially damped streaming setting is to prioritize recent points, whereas the batch model has no notion of recency.

deepakn94 commented 8 years ago

Then, we could have a streaming run with whatever decayRate (I don't think we actually sweep decayRate currently) -- everything else would be default -- and use that as the baseline for all streaming experiments. I really don't like having a variable ground truth set for a bunch of experiments: it means that two precision-recall graphs we get are completely incompatible.

pbailis commented 8 years ago

Good -- I like this. We'll fix one configuration for the baseline for batch, and another for streaming.

Do you have any thoughts on "Can we record the entire runtime, excluding loading?"

deepakn94 commented 8 years ago

Yeah, that's easy to do.

pbailis commented 8 years ago

I like the idea of having a scoreboard of run times for each dataset that we use to track our optimization a from here on out. e.g., precision, recall, and tuples per second for each workload. It can become a table in the paper.

deepakn94 commented 8 years ago

Ok, that sounds good.

pbailis commented 8 years ago

One additional thought: we can easily get ground truth via experiments with synthetic data.

pbailis commented 8 years ago

Thinking through the plots/experiments for the paper:

[x] table: tuples/sec for each workload, batch and streaming, baseline configuration
[x] plots: precision/recall compared to baseline, batch and streaming (with baseline for each), for each of the above variables
[x] plots: tuples/sec for each workload, batch and streaming, for each of the above variables
[x] table: summarization speed compared to naive approach, suciu, meliou, scorpion
[ ] table/plot: sensitivity analysis of optimizations in paper
[ ] plot: scale-out for either batch or streaming workload using one pipeline per core
any others?

deepakn94 commented 8 years ago

Just read this again: "One additional thought: we can easily get ground truth via experiments with synthetic data." What do you mean by that? We could easily create a synthetic workload where it would be easy to know what the ground truth labels are? (by design)

pbailis commented 8 years ago

Yep, we can create a synthetic dataset with complete control over the data. For example, if a class of outliers has support 0.01%, we should see a detector of support 1% miss it.

On Friday, January 22, 2016, Deepak Narayanan notifications@github.com wrote:

Just read this again: "One additional thought: we can easily get ground truth via experiments with synthetic data." What do you mean by that? We could easily create a synthetic workload where it would be easy to know what the ground truth labels are? (by design)

— Reply to this email directly or view it on GitHub https://github.com/stanford-futuredata/macrobase/issues/13#issuecomment-174155428 .

deepakn94 commented 8 years ago

Right, this is similar to what we do in the load_demo script. I like the idea of presenting these numbers in the paper too, since we'll have a very well-defined notion of correctness.

stanford-futuredata / macrobase

Benchmark effect of model parameters #13

Independent Variables

Independent Variables