Closed pbailis closed 8 years ago
A few thoughts so far:
You mean the streaming case for alphaMCD = 1.0? Why is that a good baseline? Seems fairly arbitrary to me.
What if there were no itemsets detected for that case? Does that mean we truly believe that there were no itemsets to detect for that workload, regardless of the choice of alpha?
Keep in mind that we're looking for relative comparisons.
In the case of alpha, 1.0 might be extreme, but, generally, higher alpha (at least up to .5) indicates we are looking at more data in the estimate. Alpha = 0.5 should be a better estimator than alpha = 0.01.
In the case of reservoir sample sizes, larger is almost always better insofar as reservoir size << update rate.
I guess.
We need a way to normalize across each dataset on the graph. Given that the streaming runs aren't an apples-to-apples comparison with the batch run, I think it makes more sense to declare one of the streaming runs as a baseline. Do you see an alternative?
Ideally, freeze a golden set of itemsets a-priori, and then always evaluate against that. Given a workload, there is always a set of items that are associated with the "true" outliers that are not associated with the "true" inliers. The problem I have with deciding the gold set on the fly is that outlier detection can be very wrong in a lot of these cases, so deciding your gold set against some wrong results seems flawed.
Comparing against the batch case seems like a somewhat-ok proxy for what I described above. (note that I actually think comparing against what's produced by the baseline parameters in the batch case may be even better) In general, I think moving the gold set for different parameter setting is very dangerous. The true answers shouldn't really vary acording to what parameters you're evaluating.
The only caveat is that the streaming case is a different model; as you know, it uses exponentially damped windows to age out the older tuples. Is your proposal that we disable the decay?
It shouldn't matter if it's a different model. If you had a SVM model and a neural network doing the same classification task, you would have the same set of gold labels when evaluating performance. I'm proposing we have some way of determining the gold set of labels (manual annotation, looking at the results from batching on a certain set of parameters, etc) and then we stick to those labels for that workload through the entire process. Accuracy is not something that should be calculated relative to something (at least relative to something that isn't a constant) -- especially when we have no reason to believe what we compared against is truly correct. Now, you might say that the set of outliers produced would be different in the streaming and batch case. That is true; in fact, we are penalizing streaming for not being able to look at all the data. That's fine though: we're compromising accuracy for performance here; by not having to look at all the examples, we can achieve much higher throughputs.
I agree with what you're saying above. However, the actual "learning" task here is different in the streaming and batch case when decayRate
< 1. If we want a real apples-to-apples comparison, we need to set decayRate
to 1. Does that make sense?
By "model" I meant execution model, not ML model. The objective of the exponentially damped streaming setting is to prioritize recent points, whereas the batch model has no notion of recency.
Then, we could have a streaming run with whatever decayRate (I don't think we actually sweep decayRate currently) -- everything else would be default -- and use that as the baseline for all streaming experiments. I really don't like having a variable ground truth set for a bunch of experiments: it means that two precision-recall graphs we get are completely incompatible.
Good -- I like this. We'll fix one configuration for the baseline for batch, and another for streaming.
Do you have any thoughts on "Can we record the entire runtime, excluding loading?"
Yeah, that's easy to do.
I like the idea of having a scoreboard of run times for each dataset that we use to track our optimization a from here on out. e.g., precision, recall, and tuples per second for each workload. It can become a table in the paper.
Ok, that sounds good.
One additional thought: we can easily get ground truth via experiments with synthetic data.
Thinking through the plots/experiments for the paper:
Just read this again: "One additional thought: we can easily get ground truth via experiments with synthetic data." What do you mean by that? We could easily create a synthetic workload where it would be easy to know what the ground truth labels are? (by design)
Yep, we can create a synthetic dataset with complete control over the data. For example, if a class of outliers has support 0.01%, we should see a detector of support 1% miss it.
On Friday, January 22, 2016, Deepak Narayanan notifications@github.com wrote:
Just read this again: "One additional thought: we can easily get ground truth via experiments with synthetic data." What do you mean by that? We could easily create a synthetic workload where it would be easy to know what the ground truth labels are? (by design)
— Reply to this email directly or view it on GitHub https://github.com/stanford-futuredata/macrobase/issues/13#issuecomment-174155428 .
Right, this is similar to what we do in the load_demo script. I like the idea of presenting these numbers in the paper too, since we'll have a very well-defined notion of correctness.
Independent Variables
In rough order, we want to sweep each of the following variables:
To evaluate the model behavior:
To evaluate the summary behavior:
(Note: stoppingDelta will require a configuration file change.)
Independent Variables
The figures of interest we want to measure (and eventually plot):