johnmyleswhite / StreamStats.jl

Compute statistics over data streams in pure Julia
48 stars 7 forks source link

Notice: StreamStats.jl has been deprecated in favor of OnlineStats.jl. OnlineStats has a superset of the features available in StreamStats and development is active.



Compute statistics from a stream of data. Useful when:

Example Usage

Every statistic is constructed as a mutable object that updates state with each new observation:

using StreamStats

var_x = StreamStats.Var()
var_y = StreamStats.Var()
cov_xy = StreamStats.Cov()

xs = randn(10)
ys = 3.1 * xs + randn(10)

for (x, y) in zip(xs, ys)
    update!(var_x, x)
    update!(var_y, y)
    update!(cov_xy, x, y)
    @printf("Estimated covariance: %f\n", state(cov_xy))

state(var_x), var(var_x), std(var_x)
state(cov_xy), cov(cov_xy), cor(cov_xy)

As you can see, you update statistics using the update! function and extract the current estimate using the state function, or

Available Statistics

Available Bivariate Statistics

Available Multivariate Statistics


It is also possible to estimate confidence intervals for online statistics using online bootstrap methods:

using StreamStats

stat = StreamStats.Cov()
ci1 = StreamStats.BootstrapBernoulli(stat, 1_000, 0.05)
ci2 = StreamStats.BootstrapPoisson(stat, 1_000, 0.05)

xs = randn(100)
ys = randn(100)

for (x, y) in zip(xs, ys)
    update!(stat, x, y)
    update!(ci1, x, y)
    update!(ci2, x, y)

state(stat), state(ci1), state(ci2)

Given any other statistic object, you can use the BootstrapBernoulli or BootstrapPoisson types to estimate a confidence interval. These types require that you specify the number of bootstrap replicates (i.e. 1_000) and the error rate for nominal coverage of the confidence interval (i.e. 0.05).


The code for computing moments from a stream is derived from John D. Cook's code for computing the skewness and kurtosis of a data stream online.