Closed kolia closed 4 years ago
I think I see what you're getting at, but there's a number of issues in trying to get two distributions have the same bin locs with KHist
. The bin locs would be incorrect is the main thing.
I think a better approach would be a bivariate version of OnlineStats.ExpandingHist
, which apparently I never got around to exporting.
I think a better approach would be a bivariate version of
OnlineStats.ExpandingHist
, which apparently I never got around to exporting.
To be clear, I don't need a histogram of the joint distribution, I need 2 univariate histograms that happen to share histogram edges. I guess there could be a bivariate version of ExpandingHist
similar to what you have in AverageShiftedHistograms
, but that wouldn't fit my use-case.
For the 2 (or more) distributions case, I was thinking of passing in values of the form (x, label)
, where label
selects the distribution that x
was drawn from. I would have to adapt CountMap
to count labels, not x
s, not sure how yet. And I'd have to change the adaptive binning so that it is based on x
ignoring labels. Unless you have better idea (?)
The bin locs would be incorrect is the main thing.
How so?
To be clear, I don't need a histogram of the joint distribution, I need 2 univariate histograms that happen to share histogram edges
I'm with you there. In order to do it adaptively, you need changes in X's range to affect Y's range as well.
How so?
Say I fit a new value x
which goes to bin A that is storing the counts for several X and Y values already. This adjusts the loc
correctly with respect to X, but it's random with respect to Y. My guess is that everything works out in the limit with infinite iid uncorrelated samples, but it's hard to say what will happen if that doesn't hold.
Back to ExpandingHist
idea...
So KHist
stores "bin centers" instead of edges. The data itself is used to adjust the centers. It doesn't make sense for X's data to change the bin center that Y's data should be informing.
ExpandingHist
stores bin edges. Every time a data point is observed outside of the range that defines the edges, the range gets extended that direction (with same number of bins). Visualized here:
histogram new data point
|---|---|---|---| *
(every other bin merges)
|-------|-------|-------|-------| (doubled, still doesn't include new point)
|---------------|---------------|---------------|---------------| (doubled again, includes point)
It's also very fast. You'll probably end up with a lot of empty bins in the tails (my heuristic is to use ~3 times as many bins as you actually want), but having a lot of bins doesn't slow things down like it does for KHist
(here's 1000):
julia> using OnlineStats, BenchmarkTools
julia> y = randn(10^6);
julia> o = OnlineStats.ExpandingHist(0:.001:1);
julia> @btime fit!(o, y)
48.026 ms (0 allocations: 0 bytes)
ExpandingHist: n=222000000 | value=-7.0160149123116655:0.018861167450069674:11.845152537758008
The two-sample version could keep both distribution's edges in sync without doing anything statistically dubious. You just need to know an acceptable starting range.
Thanks for the explanation! Point taken for ExpandingHist
being faster than KHist
, maybe that's the way to go. If the data has too long a tail, I can always take a log
first...
Still don't see using KHist
being an issue though, I think it's a matter of interpretation. I would interpret the multi-variate KHist
as constructing a histogram over the mixture of the various distributions--this is the mixture over x
s you get by just ignoring labels. For my feature selection use-case, this is an interpretation that makes sense, since the data were originally just a bunch of x
s drawn from a single distribution. Later on, someone attached labels to each data point based on additional info, and we're trying to CountMap
on labels in each bin to quantify how discriminative x
is of labels.
Sorry, it would have been clearer if I had formulated it this way to begin with, as a mixture.
Ah! Okay, so it's like a groupby! On the same page now.
Yes, that would work just fine I think. Maybe take a look at OnlineStatsBase.GroupBy
for inspiration.
I tried out the suggestion in your first post and saw noticeable slowdowns, even when using Counter
as the stat.
However, I've added (on master) KIndexedPartition
along with some plot recipes:
julia> o = KIndexedPartition(Float64, () -> CountMap(String));
julia> fit!(o, zip(rand(10^6), rand(["A","B","C"], 10^6)));
julia> plot(plot(o), plot(o,type=2), plot(o,type=3))
For the use-case where you want to use
OnlineStats
to compare two distributions, for example to select features by scoring them with a Kolmogorov-Smirnov (KS) or similar statistic, it is useful to do adaptive binning jointly over both distributions, but track counts for each distribution separately. (If you bin each distribution separately, you have trouble computing the KS statistic, because bins don't match.)I have an impl of this that was a fork of
AdaptiveBins
before the November 2018 rewrite ofhistograms.jl
. Now looking at how to update that code, it seems like I could generalizeKHistBin
to something like:and use
nobs
whenever I need the overall count.Then picking
S = OnlineStatsBase.Counter
should give you the current functionality, and pickingS = OnlineStatsBase.CountMap
should give you multi-distribution / multi-class counts.I'm planning to submit PR. Any comments or suggestions? @joshday