joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
MIT License
831 stars 62 forks source link

Type consistency #256

Open Crown421 opened 1 year ago

Crown421 commented 1 year ago

I have found this repo recently, and as I am integrating it into my code, I noticed that a lot of type information is lost.


> eltype(Mean)

which is surprising, given that Mean has a <:Number type parameter. I personally would expect that

> eltype(Mean(Float32))

Surprisingly other objects like FitNormal don't allow a type parameter, even though it is parametrized with V<:Variance, so one might expect something like

> tmp = FitNormal(Float32)
FitNormal{Variance{Float32, Float32, EqualWeight}}: n=0 | value=(0.0, 1.0)
> eltype(tmp)

to work.

I am not sure when I would have time to work on something like this, but I first wanted to open this issue, and see if the above would be a desired behaviour.

joshday commented 1 year ago

I'm not sure what you mean by type info is lost. eltype is used primarily for iteration, which isn't defined (e.g. for i in Mean()... is an error)

To your second point, FitNormal(Variance(Float32)) works, but I suppose the shorter FitNormal(T) would be nice to have.

Crown421 commented 1 year ago

In my specific use case I am using EnsembleProblem from SciML and reducing the results with OnlineStats, as I want to compute a lot of trajectories in a way that doesn't blow up my RAM.

My current implementation returns a Vector{<:OnlineStat} for the trajectory (which may or may not be the best option, but we will see)

However, when constructing the solution object, a eltype(eltype(T)) happens, which makes the solution parametrized with Any, which is not great.

Long story short, I had


as reference for the behaviour I had been expecting, and was hence surprised.

joshday commented 1 year ago

Hmm, okay.

Where is the eltype(eltype(T)) happening/why is that necessary? I'm trying to understand the use case since OnlineStats aren't iterable to begin with.

I'm not sure what a "trajectory" is in this context, but maybe you want to use value.(trajectory) instead of the stats directly?

Crown421 commented 1 year ago

A trajectory in the ODE/ dynamical system sense, where one might have m states, each with dimension d. This could be a scalar ODE, so each state would be a Number, or something higher dimensional, in which case each state is a Vector{<:Number}. The whole trajectory is then a Vector{<: Number} or a Vector{Vector{<:Number} Now, for something like a SDE, each solution might be slightly different, and one wants summary statistics for a (large) collection of trajectories for the distribution of states at each time step.

The way I went about this is to have a Vector{<:OnlineStat}, i.e. by doing [FitNormal() for _ in 1:m] and add trajectories via broadcasting. Once the simulation is done, I can nicely get the values out by broadcasting mean.(..), cov.(...) or similar.

I suppose I could do this via Group, but it does not seem like there is a great constructor for large groups (but I might have missed something). Even then, if I do something like

> g = Group(FitNormal(), FitNormal())
> fit!(g, rand(2))

I can't get the means out as easily as both mean.(g) and mean(g) don't work, so I have to go via value.

Further, even though Group is iterable, we again get

> eltype(g)

This is sensible, since a group could contain anything, but in a case like this, where all stats in the group are the same, one might expect a more specific eltype.

Also comparing to Distributions:

> eltype(Distributions.Normal(2.f0))
> eltype(Distributions.MvNormal([2.f0, 3.f0]))

Given that FitNormal and Normal otherwise function quite similar, it is again surprising to see a difference here.

I think that eltypes are quite useful beyond iterating to indicate what kind of data is wrapped in an object.

joshday commented 1 year ago

Thanks for the info!

I'll have to mull this over a bit since I'd rather not add methods to the OnlineStatsBase interface if I can avoid it.

Crown421 commented 1 year ago

I just took a stab at creating a convenience constructor (see #258), but stumbled over additional surprising behaviour. First, the internal type of FitMvNormal is fixed to CovMatrix{Float64}, and second the fallback does not incorporate type information even when it can be specified (i.e. for FitNormal).

julia> m = FitNormal(Variance(Float32))
FitNormal: n=0 | value=(0.0, 1.0)

julia> typeof(value(m))
Tuple{Float64, Float64}

julia> for _ in 1:3
       fit!(m, rand(Float32))

julia> m
FitNormal: n=3 | value=(0.482926, 0.478244)

julia> typeof(value(m))
Tuple{Float32, Float32}

I also note that

julia> typeof(m.v)
Variance{Float32, Float32, EqualWeight}

which suggests that it is possible to have a Float32 mean and a Float64 variance?

I have made an attempt to fix the above, let me know what you think.

On that note, I am using Float32/ Float64 as placeholders, that could also be replaced with any new user-defined type NewScalarNumberType <: Real. This might be quite interesting.