munificent / craftinginterpreters

Repository for the book "Crafting Interpreters"
http://www.craftinginterpreters.com/
Other
8.79k stars 1.03k forks source link

Hash table probing optimization #895

Closed invertego closed 3 years ago

invertego commented 3 years ago

The chapter on optimization begins with an excellent preface on the importance of measurement in the process of optimizing any piece of code. It leads into a powerful yet simple optimization to hash table probing that is based on the empirical observation that modulo operations (or more generally, division) have much, much higher latency than most other types of arithmetic. It explains how this can be remedied by replacing modulo with bitwise and.

It then spends a couple pages demonstrating how to cache the result of an integer decrement to the table capacity. Compared to the modulo replacement, this optimization feels unmotivated by empirical observation (or at least, no such motivation is provided). In contrast to division, subtraction by one should be very cheap on most architectures, and I would be very curious to see a Lox benchmark that shows a measurable difference before and after the latter optimization.

The way that these two optimizations are presented, as two halves of the same section, might lead readers to walk away with the perception that both are providing a similar benefit, when in fact the speedup can be attributed mostly (if not entirely) to the replacement of the modulo operation. I would suggest touching upon this in an aside or a postscript that challenges the reader to predict the relative merits of these optimizations and then to take measurements and compare them against those predictions.

Thank you again for the amazing book! It took me a over a year, but I'm glad to have made it through to the very end. 😃

cm1776 commented 3 years ago

Hello @invertego,

The amount of code that one has to change in order to cache the result of an integer decrement to the table capacity is, in my opinion, such as to invite OFF-BY-ONE bugs. Consequently, I did not bother to cache the decrement operation in my own implementation.

I believe that section 30.3.7 (https://craftinginterpreters.com/optimization.html#evaluating-performance) answers the concerns you raised in the paragraph that begins with The way that these two optimizations are presented, as two halves of the same section, might lead readers to walk away ...

Cheers

invertego commented 3 years ago

@cm1776 RE: the potential for off-by-one bugs (which Bob also calls out in the chapter), I am inclined to agree that the cost in code clarity may outweigh the benefit in performance.

As for the section you linked, it's full of good advice, but it's also specifically about the difficulty of benchmarking the NaN boxing optimization and only mentions the hash table optimization in passing.

munificent commented 3 years ago

This is a good point. You're right that the mod is certainly almost all of it not all of the optimization. I confess I didn't do any profiling to see whether storing the capacity and calculating the bit mask each time was noticeably slower. It seemed like an easy calculation to avoid doing each time, so I didn't see any reason not to.

But you certainly piqued my curiosity. I went ahead and measured the performance when storing the bit mask (and thus avoiding the - 1) versus continuing to store the capacity. The number of batches executed in five runs for each are:

storing bit mask:
6749
6762
6815
6835
6846

storing capacity:
6757
6835
6843
6845
6851

You're exactly right. There's no win at all to caching the - 1. In retrospect, that makes sense. There's likely a pipeline stall somewhere else in there so we can slot a single-cycle DEC in there without costing us much of anything.

I went ahead and removed this optimization which shortens the chapter and avoids a bunch of pointless code churn. Thank you for the suggestion!