mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.83k stars 154 forks source link

irmin-pack: unify LRUs and add max memory config #2254

Closed metanivek closed 1 year ago

metanivek commented 1 year ago

At a high-level, this PR:

  1. Unifies the previous 3 LRUs in irmin-pack into one
  2. Adds a configuration option to bound the LRU by memory used instead of entry count
  3. Maintains previous cap of "large" contents objects for both entry-based cap and memory-based cap

When deciding how to count the memory usage of objects in the LRU, two paths were evaluated: using Obj.reachable_words or using repr's size function. The former was too slow in replay benchmarks, so that latter is used. The size is used as-is for commits and contents, and a correction factor, based on benchmark observations, is applied to inode sizes.

The definition of "large" for contents objects is 20kB. This seems like a reasonable value based on previous analysis of object size which indicated that less than 0.1% of contents objects are larger than 512B. Note: the previous weight based limit would allow any object < ~500kB into the LRU (based on Octez's default configuration of 5000 entry cap), but lowering the cap allows more objects into the cache.

Conclusions based on benchmarks (to be pushed to the benchmarks repo once I finish tidying the analysis and running a final bench):

  1. Entry-based cap still slightly out-performs the memory-cap. My assumption is that it is the overhead of memory usage calculation for inodes.
  2. Both entry-based and memory-based can out perform 3.7.1 with only slightly more memory used.
  3. Past a certain size increase, the LRU starts hurting performance of the benchmark. More could be investigated here in the future. I did not use cachecache's LRU for this, but with some modifications to its code, it may provide better scaling. I ran its benchmarks against the modified irmin LRU in this PR, and it does seem to scale better.

The final benchmark I want to run is an 80mb, but here is some high-level analysis.

Snippet of benchmark stats:

 |                          |   3.7.1   |   entry-15k    |   entry-30k    |   40mb-optim   |   80mb-optim
 | --                       | --        | --             | --             | --             | --
 | -- main metrics --       |           |                |                |                | 
 | CPU time elapsed         |   104m07s |    99m13s  95% |    97m38s  94% |   100m07s  96% |    98m20s  94%
 | Wall time elapsed        |   104m28s |    99m33s  95% |    97m56s  94% |   100m25s  96% |    98m38s  94%
 | TZ-transactions per sec  |   736.349 |   772.810 105% |   785.217 107% |   765.767 104% |   779.649 106%
 | TZ-operations per sec    |  4809.664 |  5047.817 105% |  5128.860 107% |  5001.817 104% |  5092.489 106%
 | Context.add per sec      | 14121.692 | 14820.934 105% | 15058.886 107% | 14685.872 104% | 14952.096 106%
 | tail latency (1)         |   0.224 s |   0.222 s  99% |   0.244 s 109% |   0.245 s 109% |   0.240 s 107%
 |                          |           |                |                |                | 
 | -- resource usage --     |           |                |                |                | 
 | disk IO (total)          |           |                |                |                | 
 |   IOPS (op/sec)          |   180_572 |   142_980  79% |   119_928  66% |   145_343  80% |   122_530  68%
 |   throughput (bytes/sec) |  18.060 M |  16.541 M  92% |  15.280 M  85% |  16.609 M  92% |  15.378 M  85%
 |   total (bytes)          | 112.826 G |  98.463 G  87% |  89.517 G  79% |  99.772 G  88% |  90.737 G  80%
 | disk IO (read)           |           |                |                |                | 
 |   IOPS (op/sec)          |   180_475 |   142_878  79% |   119_825  66% |   145_242  80% |   122_427  68%
 |   throughput (bytes/sec) |  10.639 M |   8.753 M  82% |   7.367 M  69% |   8.891 M  84% |   7.521 M  71%
 |   total (bytes)          |  66.466 G |  52.103 G  78% |  43.157 G  65% |  53.412 G  80% |  44.376 G  67%
 | disk IO (write)          |           |                |                |                | 
 |   IOPS (op/sec)          |        97 |       102 105% |       103 107% |       101 104% |       103 106%
 |   throughput (bytes/sec) |   7.421 M |   7.788 M 105% |   7.913 M 107% |   7.717 M 104% |   7.857 M 106%
 |   total (bytes)          |  46.360 G |  46.360 G 100% |  46.360 G 100% |  46.360 G 100% |  46.360 G 100%
 |                          |           |                |                |                | 
 | max memory usage (bytes) |   0.394 G |   0.401 G 102% |   0.428 G 108% |   0.416 G 106% |   0.469 G 119%
 | mean CPU usage           |      100% |      100%      |      100%      |      100%      |      100%     

Here is memory usage comparisons. You can see:

  1. entry-100k and 500mb regress in performance (not shown is entry-60k but it also regresses in perf)
  2. entry-30k has the best performance (but 80mb is almost the same in perf and memory)
  3. entry-15k has almost equivalent memory usage as 3.7.1 but better performance

mem-used

art-w commented 1 year ago

To expand a bit on my comments, did you consider keeping the hashtables separate (since they cause type issues), but sharing the Q.t list across the different LRUs? (so that they have the ability to make room from other hashtables when needed)

https://github.com/mirage/irmin/blob/3222d26a3504bc6518dbc34a067be846aeeb36b7/src/irmin/lru.ml#L78-L84

The type q: (key * 'a) Q.t is an issue, but I believe it could be replaced by q: (unit -> unit) Q.t instead (a closure to free an element from whichever hashtable it was in and subtracting the removed weight from the total used space...)

(Note that this is just a thought that popped into my head! I don't think there's a strong incentive to do things differently as the current solution works nicely :) )

metanivek commented 1 year ago

Ah, thanks for sharing your extra thoughts! I didn't think about changing the core LRU implementation too much since I wanted to keep options of switching it in the future (maybe a modified cachecache or some lock-free version, for instance). If it looks too complicated to bring the extensible type into the LRU, I think we should push it out to a future time when we evaluate the actual LRU implementation.