joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
836 stars 64 forks source link

Generalize KHistBin.count :: OnlineStat{T} instead of Int #153

Closed kolia closed 4 years ago

kolia commented 5 years ago

For the use-case where you want to use OnlineStats to compare two distributions, for example to select features by scoring them with a Kolmogorov-Smirnov (KS) or similar statistic, it is useful to do adaptive binning jointly over both distributions, but track counts for each distribution separately. (If you bin each distribution separately, you have trouble computing the KS statistic, because bins don't match.)

I have an impl of this that was a fork of AdaptiveBins before the November 2018 rewrite of histograms.jl. Now looking at how to update that code, it seems like I could generalize KHistBin to something like:

struct KHistBin{T, S <: OnlineStat{T}}
    loc::T
    count::S
end

and use nobs whenever I need the overall count.

Then picking S = OnlineStatsBase.Counter should give you the current functionality, and picking S = OnlineStatsBase.CountMap should give you multi-distribution / multi-class counts.

I'm planning to submit PR. Any comments or suggestions? @joshday

joshday commented 5 years ago

I think I see what you're getting at, but there's a number of issues in trying to get two distributions have the same bin locs with KHist. The bin locs would be incorrect is the main thing.

I think a better approach would be a bivariate version of OnlineStats.ExpandingHist, which apparently I never got around to exporting.

kolia commented 5 years ago

I think a better approach would be a bivariate version of OnlineStats.ExpandingHist, which apparently I never got around to exporting.

To be clear, I don't need a histogram of the joint distribution, I need 2 univariate histograms that happen to share histogram edges. I guess there could be a bivariate version of ExpandingHist similar to what you have in AverageShiftedHistograms, but that wouldn't fit my use-case.

For the 2 (or more) distributions case, I was thinking of passing in values of the form (x, label), where label selects the distribution that x was drawn from. I would have to adapt CountMap to count labels, not xs, not sure how yet. And I'd have to change the adaptive binning so that it is based on x ignoring labels. Unless you have better idea (?)

The bin locs would be incorrect is the main thing.

How so?

joshday commented 5 years ago

To be clear, I don't need a histogram of the joint distribution, I need 2 univariate histograms that happen to share histogram edges

I'm with you there. In order to do it adaptively, you need changes in X's range to affect Y's range as well.

How so?

Say I fit a new value x which goes to bin A that is storing the counts for several X and Y values already. This adjusts the loc correctly with respect to X, but it's random with respect to Y. My guess is that everything works out in the limit with infinite iid uncorrelated samples, but it's hard to say what will happen if that doesn't hold.

Back to ExpandingHist idea...

So KHist stores "bin centers" instead of edges. The data itself is used to adjust the centers. It doesn't make sense for X's data to change the bin center that Y's data should be informing.

ExpandingHist stores bin edges. Every time a data point is observed outside of the range that defines the edges, the range gets extended that direction (with same number of bins). Visualized here:

histogram                                               new data point
|---|---|---|---|                                         *

(every other bin merges)
|-------|-------|-------|-------|   (doubled, still doesn't include new point)

|---------------|---------------|---------------|---------------|  (doubled again, includes point)

It's also very fast. You'll probably end up with a lot of empty bins in the tails (my heuristic is to use ~3 times as many bins as you actually want), but having a lot of bins doesn't slow things down like it does for KHist (here's 1000):

julia> using OnlineStats, BenchmarkTools

julia> y = randn(10^6);

julia> o = OnlineStats.ExpandingHist(0:.001:1);

julia> @btime fit!(o, y)
  48.026 ms (0 allocations: 0 bytes)
ExpandingHist: n=222000000 | value=-7.0160149123116655:0.018861167450069674:11.845152537758008

The two-sample version could keep both distribution's edges in sync without doing anything statistically dubious. You just need to know an acceptable starting range.

kolia commented 5 years ago

Thanks for the explanation! Point taken for ExpandingHist being faster than KHist, maybe that's the way to go. If the data has too long a tail, I can always take a log first...

Still don't see using KHist being an issue though, I think it's a matter of interpretation. I would interpret the multi-variate KHist as constructing a histogram over the mixture of the various distributions--this is the mixture over xs you get by just ignoring labels. For my feature selection use-case, this is an interpretation that makes sense, since the data were originally just a bunch of xs drawn from a single distribution. Later on, someone attached labels to each data point based on additional info, and we're trying to CountMap on labels in each bin to quantify how discriminative x is of labels.

Sorry, it would have been clearer if I had formulated it this way to begin with, as a mixture.

joshday commented 5 years ago

Ah! Okay, so it's like a groupby! On the same page now.

Yes, that would work just fine I think. Maybe take a look at OnlineStatsBase.GroupBy for inspiration.

joshday commented 4 years ago

I tried out the suggestion in your first post and saw noticeable slowdowns, even when using Counter as the stat.

However, I've added (on master) KIndexedPartition along with some plot recipes:

julia> o = KIndexedPartition(Float64, () -> CountMap(String));

julia> fit!(o, zip(rand(10^6), rand(["A","B","C"], 10^6)));

julia> plot(plot(o), plot(o,type=2), plot(o,type=3))
Screen Shot 2020-05-01 at 9 32 22 AM