joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

Defining Group size programmatically #289

Closed DarioSarra closed 3 months ago

DarioSarra commented 3 months ago

I am using the Group option to calculate the mean for each column of a Matrix. However, the number of columns in said Matrix is variable. I can't find a way to prepare the Group so that the Mean () is correctly calculated separately unless I can hardcode the column size. In the example below, the hardcoded Group g1 works correctly, while the Group g2, built from a collection, clumps all the data together.

using OnlineStats, LinearAlgebra
mat = rand(100,4)
g1 = 4Mean()
fit!(g1, LinearAlgebra.eachrow(mat))

Group
├─ Mean: n=100 | value=0.47693
├─ Mean: n=100 | value=0.517347
├─ Mean: n=100 | value=0.445965
└─ Mean: n=100 | value=0.515843
g2 = Group(fill(Mean(), 4))
fit!(g2, LinearAlgebra.eachrow(mat))

Group
├─ Mean: n=400 | value=0.489021
├─ Mean: n=400 | value=0.489021
├─ Mean: n=400 | value=0.489021
└─ Mean: n=400 | value=0.489021

Subquestion: Eventually, I would like to be able to calculate the means over a 3d tensor over the 3rd dimension. I'd be grateful If someone can help with that, too

DarioSarra commented 3 months ago

After some investigating, probably caused by my ignorance of Julia's syntax, I understood that the form 4Mean() is a shorthand for the multiplication symbol as in 4 * Mean(). While using the Group constructor outside the call was the reason the data were passed to all stats multiple times. So, the solution is:

mat = rand(100,4);
g1 = 4Mean();
g2 = Group(fill(Mean(), 4));
n = 4;
g3 = n * Mean();

fit!(g1, LinearAlgebra.eachrow(mat));
Group
├─ Mean: n=100 | value=0.533386
├─ Mean: n=100 | value=0.503356
├─ Mean: n=100 | value=0.469497
└─ Mean: n=100 | value=0.435189

fit!(g2, LinearAlgebra.eachrow(mat));
Group
├─ Mean: n=400 | value=0.485357
├─ Mean: n=400 | value=0.485357
├─ Mean: n=400 | value=0.485357
└─ Mean: n=400 | value=0.485357

fit!(g3, LinearAlgebra.eachrow(mat))
Group
├─ Mean: n=100 | value=0.533386
├─ Mean: n=100 | value=0.503356
├─ Mean: n=100 | value=0.469497
└─ Mean: n=100 | value=0.435189

g1 == g3 true

It might be worth considering to add an example of this construction method in the docs of Group

joshday commented 3 months ago

I actually thought I had removed the n * stat syntax to create a Group. I know that I meant to, which is why it isn't in the docstring.

I will, however, add an example of passing a collection to Group.

DarioSarra commented 3 months ago

If you are planning to remove this method. The example that led me to this use was in the docs Details of Updating (fit!).

This might not be the right place to ask a question but I think it's related. My final goal was to compute N separate Means after vectorizing a circularbuffer Matrix, and update the means after a certain number of steps of the Circular buffer. Hower this doesn't seem to work either:

using OnlineStats, LinearAlgebra, DataStructures

mat = rand(10,5)
resh = reshape(mat, (size(mat,1) * size(mat,2), 1))
g = length(resh) * Mean()
fit!(g, LinearAlgebra.eachrow(resh))

In this example the Means() end up taking 50 inputs each instead of 1

joshday commented 3 months ago

Ah, thanks for the pointer to that example.


I would expect the fit! in your example to be an error. I'm not entirely sure what you're trying to do there, but if there's a bug, let's put it in a new issue.