More information about what to cover

CamDavidsonPilon commented 10 years ago

In the README, only basic summary statistics are mentioned: mean, median, mode and I'm guessing std. dev, variance, correlations, etc are also to be included. I'm curious about how far out you are interested in extending material. What about

histograms,
linear regression,
common estimators of stat models (I'm thinking $\hat{p}$ in a binomial model),
problems with PCA,
using KNN in high-dimensional sparse spaces, etc.

I'm guessing I'm asking: who is the audience of the text? Are we interested in an intro text, or advanced text?

johnmyleswhite commented 10 years ago

I'd like to cover anything that's not so advanced it would be hard to follow for non-specialists. So linear regression and histograms are great, kNN is right inside the boundaries, and graphical models would be too far out.

As you can probably tell from the text I've written so far, the current treatment assumes people have a pretty solid mathematical background, although the material is still really simple.

robbymeals commented 10 years ago

Yeah this was my first question too. Actually I think it might be useful to more explicitly define what is meant by "counterexample".

I am interpreting this to mean: "As a practitioner or a consumer of statistical analysis with a reasonable level of sophistication and training, either formal or on-the-job, these are examples of things where your training and sophistication may fail you." Not "here is a weird edge case that is interesting but you will never see" or "here is a seeming paradox or counterintuitive idea that is derived from basic stat principles".

That about right?

johnmyleswhite commented 10 years ago

Personally, I'd like to focus on examples where you learn something about statistical by thinking through that example.

So examples that directly occur in real applications are really great, but less realistic examples are also really effective if they get you to immediately understand an important issue that you might otherwise be confused about. The confidence interval example from Berger and Wolpert that I want to write up is in the later category: the example uses some very unrealistic assumptions in order to tell a story that's much simpler to understand that any other example I've ever seen about confidence intervals. It's kind of like a Kafka short story in that regard.

Maybe we should downplay the word "counterexample" and focus more on "illustrative examples" that illuminate a non-obvious property of statistical methods. To me, the important part is to supplement traditional abstract theorems with highly specific, easy-to-follow examples. And then use this to emphasize a point that I really care about: reminding people that there's no universal solution and that even the simplest methods break on well-constructed edge cases.

FWIW, the first medians example is a theoretical articulation of a problem that really happened at Facebook at one point.

StefanKarpinski commented 10 years ago

I think the book will be more interesting and helpful if it also provides solutions in the sense of not only giving you the negative – i.e. "here's why the obvious thing doesn't always work very well" – but also balances that with the positive – i.e. "here's the less obvious thing you can do that's better". My favorite example, linked above, is rather than using histograms, which are rife with problems, using empirical CDFs, which are almost universally better than histograms, except for being less obvious.

johnmyleswhite commented 10 years ago

When possible, solutions are great. But I think sometimes the solution is to give up: for example, there's no single summary statistic for a bimodal distribution that I find very useful. So I don't think we should require solutions, even if we encourage them.

Are CDF's actually better? I had a discussion about that recently with my coworkers and didn't think CDF's were actually much more trustworthy than histograms. FWIW, there's a well-known result in statistics that KDE's have superior convergence rates relative to histograms.

StefanKarpinski commented 10 years ago

I would also be interested in cases where naïve approaches work better than expected. E.g., there are a lot of situations where it's really just fine to assume normality, even if you know that it's not the case. The effect of assuming a wrong model is often to make tests more sensitive than they should be. If you're trying to conclude whether an effect is present or not, this is devastating, since you may falsely conclude that there is an effect, when in fact, you just have an incorrect model. In other situations, such as using statistical tests in control systems, it may be perfectly acceptable to be overly sensitive. For example, at some point I designed an online test to choose among alternative channels of communication using a t-test on packet round-trip times. They are not even remotely normally distributed (it's closer to log-normal), but it was ok, because the effect of this wrong model was to make the occasional very-high-latency packet really bump up the estimate of the mean, which was actually a desirable feature. Addressing when assuming known-wrong models is safe and when it's unsafe is a pretty interesting general topic.

StefanKarpinski commented 10 years ago

A histogram does smoothing by binning, while a KDE does smoothing by a different kernel. An ECDF does no smoothing – it just plots the data – so there's far less room for the visualization to be misleading. You can, of course, still draw incorrect conclusions from what you see, but at least the visualization itself can't lie to you. Instead of relying on smoothing, each point in an ECDF is buffered by all the data, so if you draw an error envelope around the ECDF, it is much tighter than the error envelope around a histogram. KDEs are fine if you have a reason to suspect a smooth underlying behavior for your data, but that's just not true a lot of the time and you get these drastically false smooth estimates for KDEs. I'm not even sure how to go about drawing error bounds on a KDE.

johnmyleswhite commented 10 years ago

I think your point about the robustness of the t-test to deviations from assumptions would be perfect for a series of examples where the t-test works well and also where it works poorly relative to a rank sum test or similar non-parametric test for differences.

johnmyleswhite commented 10 years ago

Assuming that you're working with a smooth distribution, for both ECDF's and KDE's, you can calculate integrated mean square error for any given data set: http://en.wikipedia.org/wiki/Mean_integrated_squared_error

It would be interesting to see which is worse in a couple of reasonable settings.

StefanKarpinski commented 10 years ago

Ah, yes, well, I have had the dubious pleasure of working extensively with non-smooth, non-parametric data. Specifically, distributions of properties of network traffic data: packet sizes, inter-packet intervals, packets per flow, etc. Nothing is ever, ever nice in that world.

johnmyleswhite / SimpleAintEasy

More information about what to cover #2