johnmyleswhite / StreamStats.jl

Compute statistics over data streams in pure Julia
Other
48 stars 7 forks source link

Merge StreamStats.jl into OnlineStats.jl? #24

Open tbreloff opened 9 years ago

tbreloff commented 9 years ago

It's my understanding that this repo has been inactive for a while and that the functionality could be merged into OnlineStats.jl, which has lots of awesome online algorithms thanks to Josh Day's PhD work (and it should probably be included in JuliaStats).

John (or anyone else willing to help): assuming this makes sense to you, what are the steps required for a merge? What exists here that doesn't exist in OnlineStats? Do you need/want help with any part of it?

johnmyleswhite commented 9 years ago

Hi Tom,

Nice meeting you at JuliaCon. I think the main things we'd need to merge are the HyperLogLog implementation, the online bootstrap and the AdaGrad-based SGD OLS/logit code. I'd love help porting all of that.

I'd also like to move OnlineStats.jl into JuliaStats so that we can all collaborate more effectively.

tbreloff commented 9 years ago

Nice meeting you too! (and I'm impressed by the immediate response). When I have a few minutes to dive into your code again, I can try to make a first pass merge. @joshday may be able to help as well, since he'll probably have better knowledge about the implementation details.

I'd love for the package to get more visibility in the hope that others can help add online variations to lots of algorithms. Also, as a side note, Josh and I discussed your desire to allow various user-defined weightings, and that should be pretty straightforward given the design... equal, exponential, and stochastic weightings are already supported.

johnmyleswhite commented 9 years ago

I think the really awesome thing here would be showing how to use this stuff with the aggregation UDF's in SQLite that Jacob Quinn's been working on.

tbreloff commented 9 years ago

Agreed, and on that note, I was brainstorming yesterday (thanks to comments from @quinnj) on how to make an interface to this stuff that made it easier to compose/chain various online stats and data flows without explicitly updating each object. This is very similar to Reactive, which might be the right package to use. It would be nice to write something like:

demeaned = x - Mean(x)

Here x is some stream in a Reactive pipeline, and Mean(x) implicitly produces an OnlineStat which lifts from the stream. Now every time there's a new data point, Mean(x) is updated, which is then lifted to update demeaned, all without calling update!. This may be trickier than expected considering all the different ways data can come in, but it could drive some really nice data analytics.

joshday commented 9 years ago

I'm happy to help to get things rolling. Sounds like I'll need to look a bit more into SQLite and Reactive as well. I like where this is headed.

OnlineStats issue for adding StreamStats functionality

joshday commented 8 years ago

@johnmyleswhite After seeing some of the comments on the PRs here, would you be receptive to a PR that adds a note in the README which directs people to OnlineStats.jl?

johnmyleswhite commented 8 years ago

Yes, I definitely would be open to that.

joshday commented 8 years ago

Excellent. Thanks!

meggart commented 8 years ago

I am trying to move my code from StreamStats to OnlineStats. StreamStats does overload Base.merge to merge two StreamStats, which becomes very useful for parrallel applications. Is there a similar functionality in OnlineStats?

joshday commented 8 years ago

There used to be merge methods for most OnlineStat types. They've been temporarily lost in some rewrites. I'll make it a priority to get them back in.

meggart commented 8 years ago

Great, thanks for your work.