Limit length of `times` arrays in profiling `*pdata*`

taoensso / timbre

Pure Clojure/Script logging library

https://www.taoensso.com/timbre

Eclipse Public License 1.0

1.44k stars 171 forks source link

Limit length of `times` arrays in profiling `pdata` #57

Closed kyptin closed 10 years ago

kyptin commented 10 years ago

The time that a labeled section of code takes during profiling is stored in a vector of times in *pdata*. Currently, all of the times are accumulated first, and the summary statistics (min, max, mean, etc) are computed after the code has finished.

This approach works but can lead to OutOfMemory errors for code that is repeated heavily, e.g. the function being mapped over a large sequence.

An alternate approach is to compute the stats using a "streaming" (or "one-pass") approach, in which the times themselves are not saved, but are used merely to update the current values of the summary statistics. The stats are amenable to such an approach—even the MAD, so long as M stands for "mean" and not "median". One problem that could be introduced by this approach is a loss of precision, unless fractions and big-int math were used, which could get unwieldy after millions of iterations.

Another approach, suggested by @ptaoussanis, is to let the times accumulate to a point, then summarize them into the statistics. For example, for the first 10,000 times, simply accumulate them in the vector. Then, use the next time as a trigger point for summarizing the first 10k times, and add the 10,001st time as the first time in a new accumulator vector. At the 20,001st time, times 10,001 through 20,000 are folded into the summary statistics, and the accumulator vector is emptied.

A significant challenge faced by both approaches is how to do the calculation apart from the profiled code. In other words, how can you prevent the statistics calculations from falsely inflating the amount of time reported on the 10,001st time? I don't have a good solution to this obstacle. If this is a deal-breaker, I suggest documenting the limitation somewhere.

ptaoussanis commented 10 years ago

Oh, parallel replies :-) Just followed-up on the other thread: https://github.com/ptaoussanis/timbre/issues/56#issuecomment-37756335

A significant challenge faced by both approaches is how to do the calculation apart from the profiled code. In other words, how can you prevent the statistics calculations from falsely inflating the amount of time reported on the 10,001st time? I don't have a good solution to this obstacle. If this is a deal-breaker, I suggest documenting the limitation somewhere.

This is a good observation but I wouldn't worry about it to much. 1. We can do the stats aggregation in another thread. 2. The stats aggregation is pretty cheap, and would be done rarely. The amortised time impact should be low (if not negligible) compared to the (presumably quite expensive) profiled calls.

The biggest counter-argument to this change IMO is: is the extra code complication worth it if no one's ever going to encounter a problem in realistic use cases?

(This is a really well-formulated issue btw - thanks for that!)

kyptin commented 10 years ago

Makes sense that you could do the calculation in another thread.

Yeah, I'm on the fence about this, too. I could see it being an uncommon need to go beyond 10m iterations. But then, when you need it, you need it. I don't really know. I wonder what a well established profiling library, maybe in the Java world, does about this…

Personally, I'd be satisfied with a note somewhere in the documentation that mentions the limitation and possible workarounds.

ptaoussanis commented 10 years ago

But then, when you need it, you need it.

That's fair enough :-)

Okay, let's do this - let's keep the issue open. I'll continue to think about it. Feel free to add any extra thoughts if you have them.

Probably what this will come down to - someone needs to take a stab at implementing the change to see what it'd look like in practice. If it's not too unwieldy, we can pick it up just to be on the safe side. If it needs to be unwieldy, then we'll need to make a decision if it's worth it :-)

I've got 15 mins now - will see what I can come up with in that time.

kyptin commented 10 years ago

That makes a lot of sense on all accounts. Thanks!

kyptin commented 10 years ago

Wow, I hit this limitation earlier than I thought. I now have a sequence of about 50 million items and 6 function calls per item to profile. That's 50M items * 6 function calls per item * 8 bytes = about 2.2 GB of times. That's a lot of profiling overhead. The chunked statistics approach would have about 10k times * 6 functions * 8 bytes = about 1/2 MB of overhead.

ptaoussanis commented 10 years ago

That's a lot of profiling overhead.

That's a lot of profiling! :-) Even without the memory overhead, that's going to incur a significant performance hit. Is there a reason you need/want to profile such high-rep targets - instead of, say, a smaller but representative test case?

Anyway, let me finish up the experiment from earlier and let's compare - need ~another 10->15 mins.

ptaoussanis commented 10 years ago

Okay, this could serve as a proof-of-concept: https://github.com/ptaoussanis/timbre/compare/dev?expand=1 (EDIT: Updated URL to reflect most recent dev commit)

A few notes:

Tried to keep the implementation as simple as possible, but there is an increase in overall complexity - we'll need to decide if it's worth it.
Running stat merge on another thread was troublesome since the merging is so much slower than stat collection that the Clojure STM seems to strongly favour the collection thread. Have done the next most obvious thing here: merge on the main thread, but time the merging and include it in final stats.
Because of the above, bumped the merger to only kick in every 50k times.
Wrote this very quickly and haven't tested it much - there's likely still kinks.

Feedback welcome!

kyptin commented 10 years ago

True: it is a lot of profiling! Good question. The main reason is that the smaller but representative test case is hard to put together. I'm constrained by needing to keep real customer data out of the code base, which means I would need to anonymize the real user ids (in a consistent way, not just stripping them out). And in reality, I would probably need to do a bit of information auditing to be sure no other information should be considered privileged. Which is not to say I couldn't do all that, just that it's a hassle. :-)

Haven't looked carefully at your solution yet, but the approach seems sound to me. Let me try it out on my dataset and report back.

kyptin commented 10 years ago

I looked through the code, and it looks good for a first pass. I think that, with some care, we can shift some things around to make the code as legible as before. So that source of complexity doesn't seem problematic.

The bigger problem in my mind is that the stats merge is included in the reported times. That will, of course, affect the stats, which means that we've traded one kind of overhead for another. (I guess, in theory, some amount of overhead is necessary.) Anyway, the problem (Clojure's STM preferring the shorter swap!s) is clear to me. I'll be thinking about how it could be addressed.

kyptin commented 10 years ago

This worked great on my dataset—thanks!

One bug I noticed is that the MAD is not being calculated correctly. For any label with >= 50k calls, the MAD is always 0ns. I haven't looked to see why that is, but it occurs to me that we'll need to take a different approach for the MAD. Rather than accumulating the MAD, I think we'll need to keep the signs as we accumulate the residuals, so that adjustments to the mean can be accommodated. More specifically, I propose that we keep the MAD in the output, but accumulate the sum of the residuals (i.e. sum(X_i - X_mean for each i)) as we go.

In other words, if the count-so-far n is 100k, the accumulated mean (of the first 50k measurements) m1 is 0.5, and the new mean m2 (including all 100k measurements) is 0.4, we'll need to recalculate the sum-of-residuals number to take the new mean into account. I think this recalculation should amount to adding 0.1 * 50k to the sum-of-residuals number.

Tell me if this isn't clear; I can explain further.

ptaoussanis commented 10 years ago

One bug I noticed is that the MAD is not being calculated correctly.

Ahh, forgot a multiplication - thanks!

we'll need to recalculate the sum-of-residuals number to take the new mean into account.

That's interesting - hand't considered that. Figured we'd just accept a loss of accuracy proportional to the variance between timing group means. I'll play around with that tonight - nice idea!

That will, of course, affect the stats, which means that we've traded one kind of overhead for another.

That's an inherent trade-off with accumulating stats unfortunately. This'd be true (due to STM contention) even if we could run the merge op off the main thread. In this case we're trading a small, measurable perf impact against the potential for unbounded memory use.

I've bumped the GC limit to kick in only on jobs targeting >100k reps. Even a million would be reasonable I think (~10MB of memory). Really, we're doing this just to account for the rare instances of multi-million rep targets.

Will get back to you if I have a chance to play with an improved MAD a bit later. Cheers :-)

ptaoussanis commented 10 years ago

Thought about this a little more, I think accounting for mean variance is going to be tough. A naive sum (sum(X_i - X_mean for each i)) will always be zero without an absolute diff.

But then we have to carry info for the distribution of signed differences, and somehow reconstitute that when recalculating residuals with the new population mean. It's not immediately obvious to me how we'd approach that (any ideas?).

A more obvious alternative might be some kind of streaming variance statistic - I recall that there's some with nice properties that are very simple to implement. Not a big fan of variance in this context though.

Have limited time to look into this myself atm - are you familiar with any stream-capable MAD methods?

Otherwise - practically speaking: the current naive approach produces an error proportional to the variance between timing group means. Given that the groups are large (~100k timings) - it seems reasonable to conclude that group mean variance will be low in most real-world applications. It's not ideal, but it might be good enough for our purposes here if no obvious alternatives present?

kyptin commented 10 years ago

A naive sum (sum(X_i - X_mean for each i)) will always be zero without an absolute diff.

Ha, good point! I missed that.

But then we have to carry info for the distribution of signed differences, and somehow reconstitute that when recalculating residuals with the new population mean. It's not immediately obvious to me how we'd approach that (any ideas?).

No...

A more obvious alternative might be some kind of streaming variance statistic - I recall that there's some with nice properties that are very simple to implement. Not a big fan of variance in this context though.

Have limited time to look into this myself atm - are you familiar with any stream-capable MAD methods?

No, I'm not, but I did find this SO answerhttp://stackoverflow.com/a/3907612/202292, which appears to do the trick.

Otherwise - practically speaking: the current naive approach produces an error proportional to the variance between timing group means. Given that the groups are large (~100k timings) - it seems reasonable to conclude that group mean variance will be low in most real-world applications. It's not ideal, but it might be good enough for our purposes here if no obvious alternatives present?

I think this assumes that the underlying mean doesn't change as the program runs, which I think is not a good assumption in the real world. For example, as the heap fills up, it takes longer to create and access objects. So I think this approach could end up hiding some pretty significant errors--and not just in unusual cases, either.

I agree with your preference of MAD over standard deviation, but if the above SO answer doesn't work, the Welford Algorithmhttp://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithmmight be a good option to compute the standard deviation online.

ptaoussanis commented 10 years ago

the Welford Algorithm might be a good option to compute the standard deviation online.

Yeah, that seems to be the one that comes up. (Note that the earlier SO answer is also actually the Welford/Knuth algo).

Interestingly, the naive approach (with error proportional to between-group variance) seems to produce a better approximation than Welford/Knuth. This could be explained by the fact that Welford/Knuth was designed to operate on inputs 1-at-a-time whereas the naive approach we're using has batches of n>>1 inputs (100k currently). I'm surprised by that - figured there'd be some more sophisticated technique.

Your original suggestion got me thinking though, and I ended up experimenting with a way of correcting for inaccurate prior means by bucketing deltas into -ive and +ive groups:

(defn merge-times>stats!
  ([] ; -> {<pid> <merged-stats>}
   (reduce (fn [m pid] (assoc m pid (merge-times>stats! pid)))
    {} (keys (or @*pdata* {}))))

  ([id] ; -> <merged-stats>
   (->
    (swap! *pdata*
      (fn [m]
        (let [{:as   m-id
               :keys [times ntimes stats]
               :or   {times  []
                      ntimes 0
                      stats  {}}} (get m id {})]
          (if (empty? times) m
            (let [ts-count   (max 1 ntimes)
                  ts-time    (reduce + times)
                  ts-mean    (/ ts-time ts-count)
                  ;; Non-streaming Mean Absolute Deviation = |ts-mean|/n.
                  ;; To get streaming we instead collect deltas separated by
                  ;; sign. This seems to be more accurate that Knuth/Welford,
                  ;; Ref. http://goo.gl/mx5eSK, http://goo.gl/QLSfOc - probably
                  ;; due to larger batch sizes.
                  [ts-mad-s+ ts-mad-s-]
                  (reduce (fn [[s+ s-] t]
                            (let [delta (- t ts-mean)]
                              (if (>= delta 0) [(+ s+ delta) s-]
                                               [s+ (+ s- delta)])))
                          [0 0] times)
                  ;;
                  s-count   (+ (:count stats 0) ts-count)
                  s-time    (+ (:time  stats 0) ts-time)
                  s-mean    (/ s-time s-count)
                  ;;
                  s-mad-s+  (if-not (:count stats) ts-mad-s+
                              (+ (:mad-s+ stats) ts-mad-s+
                                 (* (:count stats) (- s-mean (:mean stats)))))
                  s-mad-s-  (if-not (:count stats) ts-mad-s-
                              (+ (:mad-s- stats) ts-mad-s-
                                 (* (:count stats) (- s-mean (:mean stats)))))
                  s-mad-sum (+ (Math/abs (long s-mad-s+))
                               (Math/abs (long s-mad-s-)))
                  s-mad     (/ s-mad-sum s-count)
                  ;;
                  s-min (apply min (:min stats Double/POSITIVE_INFINITY) times)
                  s-max (apply max (:max stats 0)                        times)]
              (assoc m id
                (assoc m-id
                  :times []
                  :ntimes 0
                  :stats {:count   s-count
                          :min     s-min
                          :max     s-max
                          :mean    s-mean
                          :mad-s+  s-mad-s+
                          :mad-s-  s-mad-s-
                          :mad     s-mad
                          :time    s-time})))))))
    (get-in [id :stats]))))

Even with pathological times (very high between-group variances), this seems to approximate the true MAD quite well. Didn't have a lot of time to play with this today, but I'll push what I have now & you can experiment.

ADDENDUM: After noting that Welford/Knuth is generally considered pretty good, and our naive approach is already its equal or superior - am inclined to suggest we call it a day and go with the less convoluted approach. What do you think?

The naive approach, for comparison:

(defn merge-times>stats!
  ([] ; -> {<pid> <merged-stats>}
   (reduce (fn [m pid] (assoc m pid (merge-times>stats! pid)))
    {} (keys (or @*pdata* {}))))

  ([id] ; -> <merged-stats>
   (->
    (swap! *pdata*
      (fn [m]
        (let [{:as   m-id
               :keys [times ntimes stats]
               :or   {times  []
                      ntimes 0
                      stats  {}}} (get m id {})]
          (if (empty? times) m
            (let [ts-count   (max 1 ntimes)
                  ts-time    (reduce + times)
                  ts-mean    (long (/ ts-time ts-count))
                  ts-mad-sum (long (reduce + (map #(Math/abs (long (- % ts-mean)))
                                                  times))) ; Mean absolute deviation
                  ;;
                  s-count   (+ (:count stats 0) ts-count)
                  s-time    (+ (:time  stats 0) ts-time)
                  s-mean    (long (/ s-time s-count))
                  s-mad-sum (long (+ (:mad-sum stats 0) ts-mad-sum))
                  s-mad     (long (/ s-mad-sum s-count))
                  s-min (apply min (:min stats Double/POSITIVE_INFINITY) times)
                  s-max (apply max (:max stats 0)                        times)]
              (assoc m id
                (assoc m-id
                  :times []
                  :ntimes 0
                  :stats {:count   s-count
                          :min     s-min
                          :max     s-max
                          :mean    s-mean
                          :mad-sum s-mad-sum
                          :mad     s-mad
                          :time    s-time})))))))
    (get-in [id :stats]))))

kyptin commented 10 years ago

Wow, good catch about the SO answer being an adaptation of the Welford algorithm. I didn't notice that.

I don't really understand the purpose of bucketing the deltas. But in any case, yes, the relatively naive approach sounds good to me, at least for now. Nice work. :-)

At some point in the future, we could set up an experiment to try out various techniques. We could run some program that will produce several million timing measurements, with heap contention towards the end of the run in increasing severity (maybe ending with an OutOfMemory error), and instead of summarizing the stats online, just log them to a file. Then we could try out various algorithms on the measurements, offline. Just an idea for later.

ptaoussanis commented 10 years ago

I don't really understand the purpose of bucketing the deltas.

By definition the sum of the (signed) differences from the mean will always be zero. You'd normally get around this by squaring or taking the absolute value of the differences to get a summable sequence. That kills our ability to compensate for prior inaccuracies in the mean though - so an alternative would be just to keep the +ive and -ive signed differences separate and only apply the sign elimination at the very end of the process.

There's still a loss of accuracy this way (depends on the size of the mu compensation) - so it's not 100% accurate, but it can be a lot more accurate the the naive/Welford approach.

Anyway, I think in practice our naive approach (basically Welford with large groups instead of individual samples) will be quite resistant to most kinds of real-world behaviour.

with heap contention towards the end

This is a good observation. Since our group sizes are large though (currently 100k), it's unlikely that between-group variances will be large. Usually times would start creeping up gradually, so the effect would be continuous between groups. What would trip us up is some kind of large discrete step-wise jump: like the first 100k samples take 100ms each and the next 100k take 1000ms each.

If you're curious, I ran some tests on this already - they're at the bottom of the profiling ns:

 (let [;; MAD = 154.0ms, natural:
        ;; n->s {0 10 1 100 2 50 3 500 4 8 5 300 6 32 7 433 8 213 9 48}
        ;; MAD = 236.0ms, pathological:
        n->s {0 10 1 11 2 5 3 18 4 7 5 2 6 300 7 400 8 600 9 700}]
    (with-redefs [stats-gc-n 3]
      (profile :info :high-sigma (dotimes [n 10]  (p :sleep (Thread/sleep (n->s n)))))))

So this is a synthetic test that represents a kind of upper bound on how bad our approximation could be.

The (large group) naive method gets a 36ms MAD (85% error) with both pathological samples and pathological group sizes (3 samples per group in this case).

The compensated method (bucketed signed differences) get a ~260ms MAD (~10% error) if I recall correctly.

So there's a big difference in this synthetic case, but I feel the difference in real world use will be much smaller due to: continuous rather than discrete mean changes, much larger sample groups (100k vs 3).

EDIT: I didn't test the 1-by-1 Welford, but it'll be strictly worse than the 85% (large group) error.

and instead of summarizing the stats online, just log them to a file

Heh heh, I never considered that :-) Sure we could just log (all) times to disk every 100k iterations, then read them back on job completion to generate stats. No approximation would be necessary - we'd have all the exact times. That'd have been easier :-) This is probably (much) faster though, so probably still a fair tradeoff.

kyptin commented 10 years ago

...and only apply the sign elimination at the very end of the process.

OK, I think that makes sense. I'm still confused about one part, though. Say we have 1M times, and a stats calculation interval of 100k. Are you talking about keeping 1M residuals around until the final calculation at the very end? Or are you talking about only keeping 100k residuals? If the former, then we are still using as much memory as if we just accumulated the times, like you were doing initially. If the latter, I think that means we can't correct anything except the immediately prior bin. But maybe I'm still not getting the genius of your approach. :-)

Since our group sizes are large (currently 100k), it's unlikely that between-group variances will be large. Usually they'd start creeping up gradually, so the effect would be continuous between groups. What would trip us up is some kind of large discrete step-wise jump: like the first 100k samples take 100ms each and the next 100k take 1000ms each.

I may just be dense here--apologies if so!--but I still don't see why between-group variances won't be large. I think in the heap contention example (e.g. with a retained head of a large sequence as previously discussed, which eventually throws an OutOfMemory error), the group means would follow, I think, something like an exponential curve. In that case, I think you'd have a pretty high variance of the group means, which would yield a high error in the MAD.

If you're curious, I ran some tests on this already - they're at the bottom of the profiling ns:

Nice. :-)

Sure we could just log (all) times to disk every 100k iterations, then read them back on job completion to generate stats.

Sorry, I probably wasn't clear. I wasn't proposing logging as a matter of policy. You're right: that would be pretty slow, not to mention messy. I just mean as an experiment. For example, rig up a simple head-retention example, log times to disk, then try out various approaches to statistics and whatnot on the dataset that we got from that one example. Then we could know exactly what the error is for various stats calculations in a large, real-world, pessimistic use case.

kyptin commented 10 years ago

I created a Gist to dump run-times to a file in a head-retention scenario, then run different variance calculations on it:

https://gist.github.com/kyptin/9714908

In a REPL with -Xms512m and -Xmx512m, this ran for maybe 10 minutes and produced 16,204,256 run-times before throwing an OutOfMemory error.

Calling -main produced the following output:

 method  |     param  |  variance
  exact  |            |  1.558484131e+12
welford  |            |  1.558484131e+12
chunked  |      1000  |  2.640064959e+12
chunked  |     10000  |  2.235416534e+12
chunked  |    100000  |  1.278604825e+13
chunked  |   1000000  |  5.869736299e+12

I'm pretty sure that I haven't gotten your method (which I'm calling chunked here) correct. Care to fix and recalculate, to see how close to the exact answer you get?

ptaoussanis commented 10 years ago

Hi Jeff, sorry - won't have much time for the next couple days so only skimmed this briefly (apologies for any errors!).

Looks like your chunked definition may be off. Intuitively: think about how you'd modify the Welford algo if you were receiving pairs (or n-sized groups) of times on each iteration instead of individual times. Would it be optimal to just ignore the groups and reduce on the times individually? If not, why not?

As a degenerate case, imagine if n = the population size. Unless I'm missing something (quite possible!), Welford with chunks should always be >= Welford w/o chunks since you're performing the same operation but with more accurate information.

Also, if I understand your data correctly - it looks like your times are uniform from start to finish? That's a perfect case for Welford/chunked since the variance is stable. To better test their limitations, try times that grow rapidly with each iteration.

Both should strain. The test I ran with partitioned signed differences should be in the commit history if you're curious to compare against that. (It performed much better in the tests I ran, but I didn't test it thoroughly or think deeply about possible problems).

Welford (chunked or individual) & similar algos are optimised for the limit case where the mean sample value eventually converges. In these cases Welford & co. will converge quickly to the exact value. Does that make sense?

Anyway, again, please take my commentary with a grain of salt - it's very possible I'm misunderstanding something. Have a good weekend, cheers :-)

kyptin commented 10 years ago

Thanks for your comments, Peter. And sorry for the late response. (I generally don't have much time to work on this project except on the weekends.)

Sorry, I don't quite get Welford well enough to follow your logic. But, I believe you. :-)

And yes, I was surprised to see that my times are uniform. I thought that heap contention would make my times increase as the program ran because of garbage collection. But I think the garbage collection doesn't ever happen in the function being mapped (* in my example), but in the cons-ing that happens as a part of for. So, back to the drawing board.

However, for now, I yield: please feel free to implement your algorithm, and I'll get out of your way and cheer you on. :-)

ptaoussanis commented 10 years ago

No problem Jeff, this turned out to be a fun problem to hack on :-) The currently implementation's just pending the next Timbre release to hit master.

Closing for now, please feel free to reopen whenever if you have any further thoughts. Enjoy the rest of your weekend, cheers :-)

taoensso / timbre

Limit length of `times` arrays in profiling `*pdata*` #57

Limit length of `times` arrays in profiling `pdata` #57