johnmyleswhite / StreamStats.jl

Compute statistics over data streams in pure Julia
Other
48 stars 7 forks source link

[WIP] Nuke state methods #12

Closed eytan closed 9 years ago

eytan commented 9 years ago

There are a number of online stats objects that have multiple states (or derivations of states) that are useful. This is a proposed change in which we use standard Julia method names for certain functionality (e.g., std(), mean(), var(), maximum()), and have object names that correspond to the statistical concepts (e.g., an online Variance object, from which you can calculate variances, means, and standard deviations).

Other notable changes include:

I still have yet to port over the moments, and I don’t understand the other stats well enough to know what the canonical method names should be.

StefanKarpinski commented 9 years ago

Sticking function objects inside of things strikes me as a bit of a code smell – it's not a very Julian way to do things. It would make more sense to use subtyping and allow extension via generic methods.

This patch combines a significant functional change with a number of orthogonal and unrelated cosmetic changes, which makes it much harder to review those changes separately. I suspect it will be much more likely to get merged if you separate the changes into separate pull requests.

johnmyleswhite commented 9 years ago

Sticking function objects inside of things strikes me as a bit of a code smell – it's not a very Julian way to do things. It would make more sense to use subtyping and allow extension via generic methods.

@StefanKarpinski: The problem is that there isn't any good generic function here. You can use a function like state, but that's problematic because (a) you end up having to reproduce a lot of code (e.g. you're forced to have both a Cov and a Cor object, although one is perfectly derived from the other) and (b) you end up having to precommit to the "true state" associated with an object like the bootstrap stat, which really is something that provides lots of distinct methods.

All that said, I agree that having a function inside an object is very un-Julian. But I don't see any better to do this.

There's also another design problem here I don't think we can solve cleanly: Julia doesn't yet provide the ability to parameterize bootstrap in such a way that you can know what the type of the state of its inner stat is. You want to have Bootstrap{Mean} contain a field that stores lots of means, but you want to have Bootstrap{ApproxOLS} contain an array of coefficient vectors.

This patch combines a significant functional change with a number of orthogonal and unrelated cosmetic changes, which makes it much harder to review those changes separately. I suspect it will be much more likely to get merged if you separate the changes into separate pull requests.

@eytan is on my team at work, which pretty radically increases the odds I merge one of this pull requests.

StefanKarpinski commented 9 years ago

@eytan is on my team at work, which pretty radically increases the odds I merge one of this pull requests.

The ability to walk over and bug someone in person does tend to have that effect :-)

johnmyleswhite commented 9 years ago

As @StefanKarpinski noted, there's a problem with putting a function inside of the bootstrap objects. This makes the function a total black box to type inference, which means that performance is going to go down. I think we need to add some benchmarks here before we can safely say how much performance is going to decrease.

eytan commented 9 years ago

Sorry guys, I should have provided more context -- John and I are building some software on top of StreamStats and we have been hitting some limitations / awkwardness, especially with respect to the bootstrap. This change was an idea we had been passing around and I just wanted to give it a shot. If this seems like a terrible idea we can scrap it.

Re: embedding a function object: an advantage of the custom function is that it allows you to bootstrap transformations of primitive Stats objects. I see how this could be problematic for performance though, so we should definitely benchmark things before going forward with this...

StefanKarpinski commented 9 years ago

The type could be parameterized on the return-type of the bootstrap functions and then you can use that parameter for annotating the return type when the function is called. That should get most of the performance back at the cost of complicating the type a bit.

johnmyleswhite commented 9 years ago

Let's chat today and resolve things. I like a lot of the changes in this PR, but it needs to match all the features in #16 to put us in a good state.