Reorganization/rewrite - Githubissues

joshday commented 8 years ago

I started toying around with changes to OnlineStats in a separate package, https://github.com/joshday/OnlineStatistics.jl (It only exists as a separate package to run comparisons between the old and the new. It is not, nor will it be, a replacement). What was meant to be just a few performance tests turned into a major rewrite (details on changes below). I've managed some performance improvements (some marginal, some orders of magnitude):

julia> include("test/performance.jl")
WARNING: replacing module Performance

  =======================================
  Performance on 10 million observations
  =======================================

                Mean new :  0.040254 seconds (5 allocations: 192 bytes)
                Mean old :  0.042746 seconds (5 allocations: 192 bytes)

        Mean (batch) new :  0.005795 seconds (6 allocations: 224 bytes)
        Mean (batch) old :  0.005977 seconds (6 allocations: 224 bytes)

            Variance new :  0.042452 seconds (5 allocations: 208 bytes)
            Variance old :  0.060829 seconds (5 allocations: 192 bytes)

             Extrema new :  0.049575 seconds (4 allocations: 160 bytes)
             Extrema old :  0.051199 seconds (4 allocations: 160 bytes)

         QuantileSGD new :  0.618786 seconds (8 allocations: 448 bytes)
         QuantileSGD old :  1.510063 seconds (94 allocations: 76.310 MB, 0.25% gc time)

          QuantileMM new :  0.772390 seconds (10 allocations: 656 bytes)
          QuantileMM old :  1.720187 seconds (96 allocations: 76.310 MB, 1.33% gc time)

             Moments new :  0.099977 seconds (6 allocations: 288 bytes)
             Moments old :  0.071233 seconds (5 allocations: 208 bytes)

  ============================================
  Performance on .2 million × 500 observations
  ============================================

               Means new :  0.010688 seconds (17 allocations: 2.469 KB)
        Means old (VERY SLOW) :

       Means (batch) new :  0.009797 seconds (16 allocations: 1.578 KB)
       Means (batch) old :  0.009701 seconds (17 allocations: 1.609 KB)

           Variances new :  0.041235 seconds (40 allocations: 7.766 KB)
    Variances old (VERY SLOW) :

   Variances (batch) new :  0.042492 seconds (37 allocations: 5.109 KB)
   Variances (batch) old :  0.037979 seconds (38 allocations: 5.156 KB)

           CovMatrix new :  2.819448 seconds (200.00 k allocations: 9.155 MB)
    CovMatrix old (VERY SLOW) :

   CovMatrix (batch) new :  0.072263 seconds (23 allocations: 237.063 KB)
   CovMatrix (batch) old :  0.065741 seconds (19 allocations: 80.625 KB)

  ===========================================
  Performance on 1 million × 5 design matrix
  ===========================================

              LinReg new :  0.033935 seconds (35 allocations: 45.779 MB, 9.72% gc time)
              LinReg old :  0.053115 seconds (37 allocations: 45.779 MB, 43.14% gc time)
           SparseReg old :  0.092843 seconds (33 allocations: 45.779 MB, 71.34% gc time)

Changes:

remove state(o) and statenames(o), replace with value(o)
- value(o) returns only the statistic, nonessential information (ex: Vector of quantiles) should show up in Base.show methods
Change Weighting to Weight, EqualWeighting to EqualWeight, etc.
- add the field nup (number of updates) to each OnlineStat, which is useful for LearningRate or similar stochastic weighting mechanisms to be added in the future
- weight!(o, n2) updates the n and nup fields and returns the weight. Take a look at https://github.com/joshday/OnlineStatistics.jl/blob/master/src/summary.jl to see how this changes update! methods
change update! to StatsBase.fit!
faster and easier to understand sweep! operator with additional method with a placeholder vector to avoid gc. sweep! has also been changed to store the upper triangular matrix, rather than lower.
Rather than let each Distribution have its own type, fitting distributions can all be done through FitDistribution and FitMvDistribution types
Remove SparseReg, it's functionality (coefficients from penalized likelihood) is now handled by LinReg
cleanup
- I had too many files floating around. For example, everything in summary/ is now in one file, summary.jl.
rename StochasticModel to StatLearn. I think hinting at statistical learning is better than "algorithms based on a stochastic subgradient". If anyone has a better name for something that incorporates SVMs and linear, logistic, poisson, huber, l1 loss, and quantile regression, I'm all ears.
probably other things I forgot

I wanted to get this out in the open before I start moving things over to OnlineStats. The biggest impact change is how weightings are handled.

tbreloff commented 8 years ago

It'll take me a little while to go through this in detail, but just based on your summary: :+1:

joshday commented 8 years ago

These changes are now in master. New docs coming soon.

joshday commented 8 years ago

Changes are now in METADATA. Docs mostly moved to README to make them easier to maintain.

joshday / OnlineStats.jl

Reorganization/rewrite #48

Changes: