joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
835 stars 63 forks source link

Reorganization/rewrite #48

Closed joshday closed 8 years ago

joshday commented 8 years ago

I started toying around with changes to OnlineStats in a separate package, https://github.com/joshday/OnlineStatistics.jl (It only exists as a separate package to run comparisons between the old and the new. It is not, nor will it be, a replacement). What was meant to be just a few performance tests turned into a major rewrite (details on changes below). I've managed some performance improvements (some marginal, some orders of magnitude):

julia> include("test/performance.jl")
WARNING: replacing module Performance

  =======================================
  Performance on 10 million observations
  =======================================

                Mean new :  0.040254 seconds (5 allocations: 192 bytes)
                Mean old :  0.042746 seconds (5 allocations: 192 bytes)

        Mean (batch) new :  0.005795 seconds (6 allocations: 224 bytes)
        Mean (batch) old :  0.005977 seconds (6 allocations: 224 bytes)

            Variance new :  0.042452 seconds (5 allocations: 208 bytes)
            Variance old :  0.060829 seconds (5 allocations: 192 bytes)

             Extrema new :  0.049575 seconds (4 allocations: 160 bytes)
             Extrema old :  0.051199 seconds (4 allocations: 160 bytes)

         QuantileSGD new :  0.618786 seconds (8 allocations: 448 bytes)
         QuantileSGD old :  1.510063 seconds (94 allocations: 76.310 MB, 0.25% gc time)

          QuantileMM new :  0.772390 seconds (10 allocations: 656 bytes)
          QuantileMM old :  1.720187 seconds (96 allocations: 76.310 MB, 1.33% gc time)

             Moments new :  0.099977 seconds (6 allocations: 288 bytes)
             Moments old :  0.071233 seconds (5 allocations: 208 bytes)

  ============================================
  Performance on .2 million × 500 observations
  ============================================

               Means new :  0.010688 seconds (17 allocations: 2.469 KB)
        Means old (VERY SLOW) :

       Means (batch) new :  0.009797 seconds (16 allocations: 1.578 KB)
       Means (batch) old :  0.009701 seconds (17 allocations: 1.609 KB)

           Variances new :  0.041235 seconds (40 allocations: 7.766 KB)
    Variances old (VERY SLOW) :

   Variances (batch) new :  0.042492 seconds (37 allocations: 5.109 KB)
   Variances (batch) old :  0.037979 seconds (38 allocations: 5.156 KB)

           CovMatrix new :  2.819448 seconds (200.00 k allocations: 9.155 MB)
    CovMatrix old (VERY SLOW) :

   CovMatrix (batch) new :  0.072263 seconds (23 allocations: 237.063 KB)
   CovMatrix (batch) old :  0.065741 seconds (19 allocations: 80.625 KB)

  ===========================================
  Performance on 1 million × 5 design matrix
  ===========================================

              LinReg new :  0.033935 seconds (35 allocations: 45.779 MB, 9.72% gc time)
              LinReg old :  0.053115 seconds (37 allocations: 45.779 MB, 43.14% gc time)
           SparseReg old :  0.092843 seconds (33 allocations: 45.779 MB, 71.34% gc time)

Changes:

I wanted to get this out in the open before I start moving things over to OnlineStats. The biggest impact change is how weightings are handled.

tbreloff commented 8 years ago

It'll take me a little while to go through this in detail, but just based on your summary: :+1:

joshday commented 8 years ago

These changes are now in master. New docs coming soon.

joshday commented 8 years ago

Changes are now in METADATA. Docs mostly moved to README to make them easier to maintain.