Initial comments - Githubissues

Still working through the paper's details (which I really like), but here are some quick comments.

I think Figure 1 needs official axis labels. I would call the y-axis latency instead of time, since the indices you use on the x-axis are essentially a rank order representation of real world time.
I would try to motivate the IID part more. Why would you expect that? What would it mean in terms of a generative model for data? You build this model later, but I would describe the big idea earlier on. "We have data. We try to construct a probabilistic model, which will enable us to use finite samples to infer properties of a larger population of measurements we can't afford to make. What properties do we expect the probabilistic model to have?"
I would talk more about the minimum time to execute in terms of its strengths and weaknesses as an estimand rather than the estimator. I am broadly in agreement that it is the most estimatable of estimands (in the sense of being most tractable to modeling and variance reduction strategies), but I think it's actually a very problematic estimand. If you knew the minimum execution time perfectly -- but nothing else -- how good would your predictions be about the total amount of time it would take to execute N more repetitions of your benchmark? For that question, I think you want the mean. So I think there's a deep conflict here that you should make explicit: the mean might be more valuable, but we have no way to measure it reliably. So we've decided to measure something that's tractable, but only a proxy for what we can about.

Thanks for reading, and for the comments!

Your point about the minimum is well-taken. It's something us here at Julia Central have gone back and forth on quite a bit. I agree that the minimum is a problematic test/summary statistic, but I would argue that is actually the correct estimand for the specific case we use it for.

Recall the motivation driving our estimator choice: we wish to estimate n, the number of benchmark executions per measurement required to overcome timer inaccuracy error. For this purpose, the minimum is a better estimand (and estimator) than the mean/median/etc., because it ensures our choice of n will be high enough to thwart timer error even if our benchmark runs "faster than average" in experiment. If, instead, we chose the mean, our subsequent choice for n might be too low for some benchmark runs.

Once again, I don't think that minimum is particularly suitable for summarization or hypothesis testing, and for those purposes, I think your comment about tractability vs. value rings very true. I'm hoping that my exploration of non-i.i.d. resampling methods will prove fruitful, and enable the use of more reasonable estimators for hypothesis testing/confidence interval calculation on these wacky benchmark samples. Of course, I'm not sure yet whether that will pan out, but there's always hope.

jiahao / paper-benchmark

Initial comments #6