joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

Best practices for high online statistics for many parameters #261

Closed vandenman closed 1 year ago

vandenman commented 1 year ago

Related to https://github.com/joshday/OnlineStats.jl/issues/158

I'm doing some MCMC where there are too many parameters to save all samples. So instead I figured, let's use OnlineStats to store only the few statistics that I'm interested in (e.g., Mean, Variance, AutoCorrelation). However, it is unclear to me how to do this properly. Right now, I have a function like so:

function initialize_online_statistics(online_statistics, p::Integer, k::Integer)

    ne_plus_diag = p * (p + 1) ÷ 2
    ne = ne_plus_diag - p

    # not sure about this structure
    return (
        a = [OnlineStats.Group([copy(online_statistics) for _ in 1:ne_plus_diag]) for _ in 1:k],
        b = OnlineStats.Group([copy(online_statistics)  for _ in 1:ne]),
        c = copy(online_statistics)
    )
end

The idea is that a user (just me for now) supplies a Series which gets passed to online_statistics. Internally, I run some MCMC algorith and after each iteration I update the statistics for a, b, and c. This way, a user can specify which statistics to track themselves and they're not baked into the code.

Example usage:

initialize_online_statistics(OnlineStats.Series(Mean(), Variance(), AutoCov(5)), 5, 10)

works fine. However, for a larger size, this same fails with a StackOverflowError:

initialize_online_statistics(OnlineStats.Series(Mean(), Variance(), AutoCov(5)), 116, 100)
ERROR: StackOverflowError:
 [1] promote_type(::Type, ::Type, ::Type, ::Vararg{Type}) (repeats 6161 times)
   @ Base ./promotion.jl:293
 [2] Group(stats::Vector{OnlineStatsBase.Series{Number, Tuple{Mean{Float64, EqualWeight}, Variance{Float64, Float64, EqualWeight}, AutoCov{Float64, Float64}}}})
   @ OnlineStatsBase ~/.julia/packages/OnlineStatsBase/4TwKN/src/stats.jl:368
 [3] #26
   @ ./none:0 [inlined]
 [4] iterate
   @ ./generator.jl:47 [inlined]
 [5] collect(itr::Base.Generator{UnitRange{Int64}, var"#26#29"{OnlineStatsBase.Series{Number, Tuple{Mean{Float64, EqualWeight}, Variance{Float64, Float64, EqualWeight}, AutoCov{Float64, Float64}}}, Int64}})
   @ Base ./array.jl:782
 [6] initialize_online_statistics(online_statistics::OnlineStatsBase.Series{Number, Tuple{Mean{Float64, EqualWeight}, Variance{Float64, Float64, EqualWeight}, AutoCov{Float64, Float64}}}, p::Int64, k::Int64)
 [7] top-level scope

I think the problem is that Group does not specialize when the contents are all the same. For example Group(Mean(), Mean()) has type Group{Tuple{Mean{Float64, EqualWeight}, Mean{Float64, EqualWeight}}, Union{Tuple{Number, Number}, NamedTuple{names, R} where R<:Tuple{Number, Number}, AbstractVector{<:Number}} where names} and Group(Mean(), Mean(), Mean()) has an additional , Mean{Float64, EqualWeight}. Eventually, a StackOverflow is reached.

For now, I can use a Vector{T} Where {T<:OnlineStat}, but I feel like I'm reinventing the idea behind Group. Perhaps there should be a specialized type for this case, like MonoGroup{T, U<:Int} where {T<:Union{Series, OnlineStat}}?

joshday commented 1 year ago

A few things:

1) Thanks for the report! I was able to make a simpler example:

julia> typeof(Group([Mean() for _ in 1:1375]...))
Group{NTuple{1375, Mean{Float64, EqualWeight}}, Union{NTuple{1375, Number}, NamedTuple{names, R} where R<:NTuple{1375, Number}, AbstractVector{<:Number}} where names}

julia> typeof(Group([Mean() for _ in 1:1376]...))
ERROR: StackOverflowError:
Stacktrace:
 [1] promote_type(::Type, ::Type, ::Type, ::Type, ::Vararg{Type}) (repeats 1368 times)
   @ Base ./promotion.jl:293
 [2] Group(stats::NTuple{1376, Mean{Float64, EqualWeight}})
   @ OnlineStatsBase ~/.julia/dev/OnlineStatsBase/src/stats.jl:368
 [3] Group(::Mean{Float64, EqualWeight}, ::Vararg{Mean{Float64, EqualWeight}})
   @ OnlineStatsBase ~/.julia/dev/OnlineStatsBase/src/stats.jl:372

I am at a loss as to why 1375 would be different than 1376. However, there is a one-line fix in OnlineStatsBase that I'll add:

# don't do this.  See below
julia> typeof(Group([Mean() for _ in 1:999999]...))
Group{NTuple{999999, Mean{Float64, EqualWeight}}, Union{Tuple{Number}, NamedTuple{names, R} where R<:Tuple{Number}, AbstractVector{<:Number}} where names}

2) For large numbers of stats in a Group, use a Vector:

julia> typeof(Group([Mean() for _ in 1:999999]))
Group{Vector{Mean{Float64, EqualWeight}}, Union{Tuple{Number}, NamedTuple{names, R} where R<:Tuple{Number}, AbstractVector{<:Number}} where names}

I'll have to do some benchmarking. There may be no benefit for using tuples even for a smaller number of stats.

3) If you already have a Variance, don't add in a Mean

A Variance needs to calculate the mean internally, so you're adding unnecessary compute by including a Mean as well. You can do:

o = fit!(Variance(), randn(100))

mean(o)
joshday commented 1 year ago

Eh, I lied. My "fix" broke other stuff. This may be some internal Julia limitation on tuple sizes. I'll look into it.

joshday commented 1 year ago

Figured it out. This is the culprit, which lives in Base:

promote_type(T, S, U, V...) = (@inline; promote_type(T, promote_type(S, U, V...)))

For a large number of arguments, e.g. promote_type(many_things...), this method is called over and over so we hit a stack overflow. Fortunately, changing some OnlineStatsBase code from promote_type(types...) to reduce(promote_type, types) fixes everything.


Re: Vectors vs. Tuples as the Group container: Tuples are faster for a small number of items.

joshday commented 1 year ago

new OnlineStatsBase release is pending.