Code cleanup - Githubissues

tbreloff commented 9 years ago

There's a bunch of places where there's unnecessary parametric types/signatures and wasteful work . For example... Base.push!{T <: ScalarStat}(df::DataFrame, obj::T) = append!(df, DataFrame(obj)) might be better as: Base.push!(df::DataFrame, obj::ScalarStat) = push!(df, value(obj))

Note that:

No need for a parametric method here... T is never used. It's no slower to give the type as obj::ScalarStat.. it's still compiled to the same method (i.e. Base.push!(df::DataFrame, obj::Var) = ...) You only want the parametric if you need T in the RHS of the assignment or if you need types to match in the signature (i.e. add{T<:Integer}(x::T, y::T) = ... will make sure x and y are the same type of integer) or if you want an abstract as the parameter of another type (i.e. Base.push!{T<:Integer}(df::DataFrame, obj::Vector{T}) = ... is appropriate)
It's inefficient to create a new one-row DataFrame just to then merge it with an existing DataFrame. You can push!() any iterable of row values to a DataFrame and it will just push!()` each of those values to the columns.

I'll make changes like these as I see them... If you have any questions on why I made a certain change, don't hesitate to talk it through here.

joshday commented 9 years ago

Ah, there is probably a few places where I do that needlessly. I'll fix them when I see it.
A ScalarStat doesn't necessarily make a one-row DataFrame (only Mean does).
- I use ScalarStat subtypes for "possibly multiple parameters that are all scalars". For instance, Moments tracks mean, variance, skewness, and kurtosis.

It looks like push!(df, array) only works if array is a single row. If this is more efficient than the other version, I'm all for it:

function Base.push!(df::DataFrame, obj::ScalarStat)
    st = state(obj)
    mat = [state_names(obj) st fill(nobs(obj), length(st))]
    for i in 1:length(st)
        push!(df, mat[i, :])
    end
end

tbreloff commented 9 years ago

i just pushed up an implementation that we should discuss. Look at the Mean/Var types, as well as common.jl, weighting.jl, and the mean/var tests.

Note: now that update!(o, y::Vector) just loops through and calls update!(o, f::Float64) for each v in vector, you can't test for equality in some places... need to test for approximate equality. If the "update one at a time" model doesn't have enough numerical stability, we can still implement batch versions using lines like: o.mean = smooth(o.mean, mean(y), weight(o))

joshday commented 9 years ago

Looks great. I think approximate equality will do for now. We can try adding some stability tests. A few random comments below.

I've added this to weighting.jl (I'm working in 0.3.7, which uses float64 instead of Float64. You must be using 0.4.0?

if VERSION < v"0.4.0-dev"
    weight(w::EqualWeighting, n1::Int, n2::Int) = n1 > 0 || n2 > 0 ? float64(n2 / (n1 + n2)) : 1.0
else
    weight(w::EqualWeighting, n1::Int, n2::Int) = n1 > 0 || n2 > 0 ? Float64(n2 / (n1 + n2)) : 1.0
end

My original intention for DataFrame(o) was to make something easy to plot, but I think your version is more expected.

df_A (for calling Gadfly.plot(df_A, x=:nobs, y=:value, color=:variable))

# variable      |   value    |  nobs
# ---------------------------------
# μ             |   1.0      | 5 
# σ²            |   2.0      | 5
# μ             |   1.5      | 10
# σ²            |   2.1      | 10
# ...

vs. df_B

# μ    |   σ²   |  nobs
# ---------------------
# 1.0  |   2.0  |  5
# 1.5  |   2.1  |  10
# 1.6  |   2.2  |  15

I think version B is more expected behavior and version A is one step away (df_A = melt(df_B, :nobs)), so df_B is what we should go with.

tbreloff commented 9 years ago

In regards to the julia version. Yes I'm using version 0.4.0-dev+4311 at the moment, which is about 11 days old. I think the correct way to handle differences between 0.3 and 0.4 is to use the Compat package. Correct syntax is something like @compat weight(w::EqualWeighting, n1::Int, n2::Int) = n1 > 0 || n2 > 0 ? Float64(n2 / (n1 + n2)) : 1.0 which would automatically do the correct conversion for 0.3.7. I haven't actually had to use Compat yet, but we should probably try to get in good habits. I'll look into it more. The good news is that we can test each others code so it'll naturally work for anyone else that needs to use it, and it'll work for you if you want to switch to 0.4 at some point. :)

joshday commented 9 years ago

I was unaware of Compat. That looks much more slick! I'll try it out.

tbreloff commented 9 years ago

I just checked the github page for compat... it looks like correct syntax might be: weight(w::EqualWeighting, n1::Int, n2::Int) = n1 > 0 || n2 > 0 ? @compat Float64(n2 / (n1 + n2)) : 1.0

joshday commented 9 years ago

Cool, trying it now.

joshday commented 9 years ago

Julia didn't like that. This seems to work:

@compat weight(w::EqualWeighting, n1::Int, n2::Int)...

tbreloff commented 9 years ago

I wonder if @compat(Float64(x)) would work, using the parenthetical version of the macro. Either way if it works correctly at the beginning of the line I think that's easier to read.

On Apr 28, 2015, at 2:27 PM, Josh Day notifications@github.com wrote:

Julia didn't like that. This seems to work:

@compat weight(w::EqualWeighting, n1::Int, n2::Int)... — Reply to this email directly or view it on GitHub.

tbreloff commented 9 years ago

Hey Josh... I was just looking through the commit history to see what you've been doing and prepare for a merge... it looks like you picked up most/all of my recent changes, but gitk doesn't show a merge. Are you calling git merge origin/tom or something similar to bring in my changes, or are you adding them manually? Just want to know if I'm seeing the correct history.

I've been writing out (on actual paper!) the update! function for the FLS algo, and wanted to sync up code before I start implementing.

joshday commented 9 years ago

I've been doing git checkout tom -- path/to/file. What is the preferred way?

tbreloff commented 9 years ago

Ok that makes sense. I think that method is ok for occasional, one-off, small changes... but normally you want to do a git merge to actually pull in my changes in a systematic way without overwriting the whole file. (If you edited it and I never picked up your change, your change would be overwritten)

The flow that you could try:

git commit -am "pre-merge commit"
git fetch
git diff origin/tom
git merge origin/tom
<fix conflicts and commit>
git push

First you commit any local changes you made with git commit
Next you git fetch to pull the branches from github into your local repo. You'll (likely) have remote branches origin/master, origin/josh, origin/tom, etc
Next git diff origin/tom will give you a report of how your currently checked out local branch compares to my branch that I've pushed up to github
Next `git merge origin/tom' will automatically combine all changes in your local history and in my branch since the last time we merged
If there are any conflicts (i.e. we changed the same code in a file), then that file will have sections demarcated by <============ and similar which will wrap the differences from both branches. You have to go through the files and delete/change whatever you need to, then save the file. (this is called fixing merge conflicts) After you're done fixing all the conflicts, do another git commit -am "fixed merge"
Finally you can git push to send your branch to github

Every once in a while when the code is stable and just after you've done the merge (so that you have all of our changes), you can:

git checkout master
git merge josh
<shouldn't be conflicts?>
git push
git checkout josh

and this should set master to be a snapshot of your branch at that moment (of course assuming you've merged in all changes to that point)

Hopefully this is good enough for 99% of cases... let me know if you have any questions. It might seem complicated right now, but it shouldn't be too bad if we're mostly working on different things.

joshday commented 9 years ago

That is super helpful. Seriously, thanks. That will be my new workflow.

tbreloff commented 9 years ago

Sure thing. I just merged in your latest, and had a bunch of conflicts (many of which weren't really conflicts but just a by-product of overwriting whole files). I fixed them, mostly by taking your version. Do you want to do a merge (as described above)? I'll recreate the tom branch from that commit.

joshday commented 9 years ago

I'll do that as soon as I'm back at my computer (probably 2 hours)

Sent from my iPhone

On Apr 28, 2015, at 5:49 PM, Tom Breloff notifications@github.com wrote:

Sure thing. I just merged in your latest, and had a bunch of conflicts (many of which weren't really conflicts but just a by-product of overwriting whole files). I fixed them, mostly by taking your version. Do you want to do a merge (as described above)? I'll recreate the tom branch from that commit.

— Reply to this email directly or view it on GitHub.

joshday commented 9 years ago

I merged your branch into mine and then mine into master.

joshday / OnlineStats.jl

Code cleanup #6