joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

Allow counts argument in `fit!` #284

Closed adknudson closed 4 months ago

adknudson commented 5 months ago

It would be really convenient to be allowed to call fit!(o, y, k) as a shorthand way to fit observation y k times. Many stats would support this with little change, like the Hist stat for example. You would change lines 94-101

function _fit!(o::Hist, y)
    i = binindex(o.edges, y, o.left, o.closed)
    if 1 ≤ i < length(o.edges)
        o.counts[i] += 1
    else
        o.out[1 + (i > 0)] += 1
    end
end

into

function _fit!(o::Hist, y, k=1)
    i = binindex(o.edges, y, o.left, o.closed)
    if 1 ≤ i < length(o.edges)
        o.counts[i] += k
    else
        o.out[1 + (i > 0)] += k
    end
end

I'm not sure how much extra work this would be, but I would be happy to draft a pull request trying to implement this if there is interest.

adknudson commented 5 months ago

I see that CountMap has this implemented with the fit!(o, y=>n) notation. Is it reasonable to expand this to all stats?

joshday commented 5 months ago

It's a reasonable request for fit!(::OnlineStat{T}, x::Pair{T, Int}) methods. I'd be okay with a fallback method and specific types can implement their own methods as needed.

adknudson commented 5 months ago

I'm looking through OnlineStatsBase to get an understanding of what I would need to change to get this working. It looks like for OnlineStat{T}, T is the type of observation that the stat can work on. So for CountMap, it is defined as

 CountMap{T, A <: AbstractDict{T, Int}} <: OnlineStat{Union{T, Pair{<:T,<:Integer}}}

which means it can accept a T, or a Pair{T, Int}. Is it really necessary for Pair{T,Int} to be part of the type? I can go two ways with a PR:

  1. Add Pair{T,Int} to all stats and implement _fit!(o, xy::Pair) for each stat
  2. Define fit!(o::OnlineStat{T}, xy::Pair{<:T, <Integer}) and the fallback method _fit!(o::OnlineStat{T}, xy::Pair{<:T, <Integer}) which can be specialized for certain stats (e.g. Hist)

I'm inclined towards option 2 as then there is no ambiguity of if a stat can fit a pair-type observation. Generally the new code would be

fit!(o::OnlineStat{T}, xy::Pair{<:T, <:Integer}) where {T} = (_fit!(o, xy); return o)

function _fit!(o::OnlineStat{T}, xy::Pair{<:T, <:Integer}) where {T}
    x, y = xy
    for _ in 1:y
        _fit!(o, x)
    end
    return o
end

I think CountMap could either keep or drop the Pair{<:T,<:Integer} in the union type. I think it would be cleaner to drop it, but I'm not sure if that counts as a breaking change.

joshday commented 5 months ago

Is it really necessary for Pair{T,Int} to be part of the type?

So now I'm remembering why I haven't done this...

fit!(o::OnlineStat{T}, x::S) will iterate through x and fit! each element, so adding pair methods is a breaking change, e.g. fit!(Mean(), 1 => 5) would work differently. Previously this would result in value=3 and nobs=2. Adding this new method would result in value=1 and nobs=5.

I think we need a new function to indicate counts, fitn! or similar.

In retrospect, building fit! around iteration was a mistake. OnlineStats 2.0 should have a push!/append!-like approach to make things more explicit.

adknudson commented 5 months ago

fit!(Mean(), 1 => 5) feels more like a bug than anything. Instead of introducing fitn!, why not dispatch on fit!(o, x, k)? That shouldn't be breaking and is easy to fall back on.