stanford-futuredata / macrobase

MacroBase: A Search Engine for Fast Data
http://macrobase.stanford.edu/
Apache License 2.0
661 stars 126 forks source link

Benchmark effect of model parameters #13

Closed pbailis closed 8 years ago

pbailis commented 8 years ago

Independent Variables

In rough order, we want to sweep each of the following variables:

To evaluate the model behavior:

To evaluate the summary behavior:

(Note: stoppingDelta will require a configuration file change.)

Independent Variables

The figures of interest we want to measure (and eventually plot):

pbailis commented 8 years ago

A few thoughts so far:

deepakn94 commented 8 years ago

You mean the streaming case for alphaMCD = 1.0? Why is that a good baseline? Seems fairly arbitrary to me.

deepakn94 commented 8 years ago

What if there were no itemsets detected for that case? Does that mean we truly believe that there were no itemsets to detect for that workload, regardless of the choice of alpha?

pbailis commented 8 years ago

Keep in mind that we're looking for relative comparisons.

In the case of alpha, 1.0 might be extreme, but, generally, higher alpha (at least up to .5) indicates we are looking at more data in the estimate. Alpha = 0.5 should be a better estimator than alpha = 0.01.

In the case of reservoir sample sizes, larger is almost always better insofar as reservoir size << update rate.

deepakn94 commented 8 years ago

I guess.

pbailis commented 8 years ago

We need a way to normalize across each dataset on the graph. Given that the streaming runs aren't an apples-to-apples comparison with the batch run, I think it makes more sense to declare one of the streaming runs as a baseline. Do you see an alternative?

deepakn94 commented 8 years ago

Ideally, freeze a golden set of itemsets a-priori, and then always evaluate against that. Given a workload, there is always a set of items that are associated with the "true" outliers that are not associated with the "true" inliers. The problem I have with deciding the gold set on the fly is that outlier detection can be very wrong in a lot of these cases, so deciding your gold set against some wrong results seems flawed.

deepakn94 commented 8 years ago

Comparing against the batch case seems like a somewhat-ok proxy for what I described above. (note that I actually think comparing against what's produced by the baseline parameters in the batch case may be even better) In general, I think moving the gold set for different parameter setting is very dangerous. The true answers shouldn't really vary acording to what parameters you're evaluating.

pbailis commented 8 years ago

The only caveat is that the streaming case is a different model; as you know, it uses exponentially damped windows to age out the older tuples. Is your proposal that we disable the decay?

deepakn94 commented 8 years ago

It shouldn't matter if it's a different model. If you had a SVM model and a neural network doing the same classification task, you would have the same set of gold labels when evaluating performance. I'm proposing we have some way of determining the gold set of labels (manual annotation, looking at the results from batching on a certain set of parameters, etc) and then we stick to those labels for that workload through the entire process. Accuracy is not something that should be calculated relative to something (at least relative to something that isn't a constant) -- especially when we have no reason to believe what we compared against is truly correct. Now, you might say that the set of outliers produced would be different in the streaming and batch case. That is true; in fact, we are penalizing streaming for not being able to look at all the data. That's fine though: we're compromising accuracy for performance here; by not having to look at all the examples, we can achieve much higher throughputs.

pbailis commented 8 years ago

I agree with what you're saying above. However, the actual "learning" task here is different in the streaming and batch case when decayRate < 1. If we want a real apples-to-apples comparison, we need to set decayRate to 1. Does that make sense?

pbailis commented 8 years ago

By "model" I meant execution model, not ML model. The objective of the exponentially damped streaming setting is to prioritize recent points, whereas the batch model has no notion of recency.

deepakn94 commented 8 years ago

Then, we could have a streaming run with whatever decayRate (I don't think we actually sweep decayRate currently) -- everything else would be default -- and use that as the baseline for all streaming experiments. I really don't like having a variable ground truth set for a bunch of experiments: it means that two precision-recall graphs we get are completely incompatible.

pbailis commented 8 years ago

Good -- I like this. We'll fix one configuration for the baseline for batch, and another for streaming.

Do you have any thoughts on "Can we record the entire runtime, excluding loading?"

deepakn94 commented 8 years ago

Yeah, that's easy to do.

pbailis commented 8 years ago

I like the idea of having a scoreboard of run times for each dataset that we use to track our optimization a from here on out. e.g., precision, recall, and tuples per second for each workload. It can become a table in the paper.

deepakn94 commented 8 years ago

Ok, that sounds good.

pbailis commented 8 years ago

One additional thought: we can easily get ground truth via experiments with synthetic data.

pbailis commented 8 years ago

Thinking through the plots/experiments for the paper:

deepakn94 commented 8 years ago

Just read this again: "One additional thought: we can easily get ground truth via experiments with synthetic data." What do you mean by that? We could easily create a synthetic workload where it would be easy to know what the ground truth labels are? (by design)

pbailis commented 8 years ago

Yep, we can create a synthetic dataset with complete control over the data. For example, if a class of outliers has support 0.01%, we should see a detector of support 1% miss it.

On Friday, January 22, 2016, Deepak Narayanan notifications@github.com wrote:

Just read this again: "One additional thought: we can easily get ground truth via experiments with synthetic data." What do you mean by that? We could easily create a synthetic workload where it would be easy to know what the ground truth labels are? (by design)

— Reply to this email directly or view it on GitHub https://github.com/stanford-futuredata/macrobase/issues/13#issuecomment-174155428 .

deepakn94 commented 8 years ago

Right, this is similar to what we do in the load_demo script. I like the idea of presenting these numbers in the paper too, since we'll have a very well-defined notion of correctness.