Significantly reduce memory usage from state caching

tersec commented 4 years ago

The target is around 500MB for beacon_node.

mratsim commented 4 years ago

linked #869

tersec commented 4 years ago

The most obvious, basic start will be to just remove the 1-to-2 epoch storage and store only a few slots back, which practically is all that's necessary at the moment.

mratsim commented 4 years ago

A tradeoff that can be considered is either:

Store N BeaconStates object and never call state_transition(), like today
Store log(N) BeaconState and call at most log(N) state_transition() (with the "compact" part of log biaised towards recent states).

The techniques come from optimization DAGs for deep-learning / automatic differentiation.

I have detailed the optimization here, there are more complex ones but it can be a start https://github.com/mratsim/blocksmith/blob/9751250ed03bd510b358be83976ef0722594df81/hotdb.md#optimization

Some figures:

A BeaconState object stack size (i.e. without counting dynamic subobjects like seq[ValidatorIndex] is 2687408 bytes, rounding up at 3MB
```
import ../beacon_chain/spec/datatypes
echo sizeof(BeaconStateObj)
```
The weak subjectivity period is https://github.com/status-im/nim-beacon-chain/blob/9636ca3e17628d4bfbdc2a29006e5385fc398e66/beacon_chain/spec/datatypes.nim#L78-L81 1.728M slots
Obviously we can't cache that many states
However log2(1.728M) = 20.7 which is much more manageable
I'm not counting the blocks to reapply to the cached state to reconstruct it because we need to keep them anyway.
Note that in case we have multiple forks live, if in the very worst case we had N forks a week ago, it would be N * 20 BeaconState in cache.

Due to both memory and computation constraints we want both memory use and CPU usage to be at worse O(log n).

We can't store all intermediate states between finalization and the current candidate heads.
We can't recompute all intermediate states between finalization and the current candidate heads. What we can do is bound logarithmically the number of state stored and the number of recomputation needed.

How?

We can use CTZ-based optimization (Count-Towards-Zero). This optimization comes from the paper

Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation\ Griewank, 2011\ http://ftp.mcs.anl.gov/pub/tech_reports/reports/P228.pdf

which is used to optimize storage and computation of Direct Acyclic Graphs for Deep Learning.

Quoting https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation

Saving memory via a CTZ-based strategy

OK, this section is not really part of the tutorial, but more of a discussion regarding a particular optimization strategy that I felt was interesting enough to deserve some elaboration (it was briefly explained on in a paper by Griewank).

So far, we have resigned ourselves to the fact that reverse-mode AD requires storage proportional to the number of intermediate variables.

However, this is not entirely true. If we're willing to repeat some intermediate calculations, we can make do with quite a bit less storage.

Suppose we have an expression graph that is more or less a straight line from input to output, with N intermediate variables lying in between. So this is not so much an expression graph anymore, but a chain. In the naive solution, we would require O(N) storage space for this very long expression chain.

Now, instead of caching all the intermediate variables, we construct a hierarchy of caches and maintain this hierachy throughout the reverse sweep:

cache_0 stores the initial value

cache_1 stores the result halfway down the chain

cache_2 stores the result 3/4 of the way down the chain

cache_3 stores the result 7/8 of the way down the chain

cache_4 stores the result 15/16 of the way down the chain

...

Notice that the storage requirement is reduced to O(log(N)) because we never have more than log2(N) + 1 values cached.

During the forward sweep, maintaining such a hierarchy would require evicting older cache entries at an index determined by a [formula that involves the count-trailing-zeros function](https://github.com/Rufflewind/revad/blob/de509269fe878bc9d564775abc25c4fa663d8a5e/src/chain.> rs#L96-L118).

The easiest way to understand the CTZ-based strategy is to look at an example. Let's say we have a chain of 16 operations, where 0 is the initial input and f is the final output:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
Suppose we have already finished the forward sweep from 0 to f. In doing so, we have cached 0, 8, c, e, and f:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                                ^
 X---------------X-------X---X-X
The X symbol indicates that the result is cached, while ^ indicates the status of our reverse sweep. Now let's start moving backward. Both e and f are available so we can move past e without issue:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                            ^
 X---------------X-------X---X-X
Now we hit the first problem: we are missing d. So we recompute d from c:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                            ^
 X---------------X-------X---X-X
                         |
                         +-X
We then march on past c.
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                        ^
 X---------------X-------X---X-X
                         |
                         +-X
Now we're missing b. So we recompute starting at 8, but in doing so we also cache a:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                        ^
 X---------------X-------X---X-X
                 |       |
                 +---X-X +-X
We continue on past a:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                    ^
 X---------------X-------X---X-X
                 |       |
                 +---X-X +-X
Now 9 is missing, so recompute it from 8:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                    ^
 X---------------X-------X---X-X
                 |       |
                 +---X-X +-X
                 |
                 +-X
Then we move past 8:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                ^
 X---------------X-------X---X-X
                 |       |
                 +---X-X +-X
                 |
                 +-X
To get 7, we recompute starting from 0, but in doing so we also keep 4 and 6:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
                ^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
                 |
                 +-X
By now you can probably see the pattern. Here are the next couple steps:
 0 1 2 3 4 5 6 7 8 9 a b c d e f
            ^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
         |       |
         +-X     +-X

 0 1 2 3 4 5 6 7 8 9 a b c d e f
        ^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
         |       |
         +-X     +-X

 0 1 2 3 4 5 6 7 8 9 a b c d e f
        ^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
 |       |       |
 +---X-X +-X     +-X

 0 1 2 3 4 5 6 7 8 9 a b c d e f
    ^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
 |       |       |
 +---X-X +-X     +-X

 0 1 2 3 4 5 6 7 8 9 a b c d e f
    ^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
 |       |       |
 +---X-X +-X     +-X
 |
 +-X

 0 1 2 3 4 5 6 7 8 9 a b c d e f
^
 X---------------X-------X---X-X
 |               |       |
 +-------X---X-X +---X-X +-X
 |       |       |
 +---X-X +-X     +-X
 |
 +-X
From here it's fairly evident that the number of times the calculations get repeated is bounded by O(log(N)), since the diagrams above are just flattened binary trees and their height is bounded logarithmically.

Here is a demonstration of the CTZ-based chaining strategy.

As Griewank noted, this strategy is not the most optimal one, but it does have the advantage of being quite simple to implement, especially when the number of calculation steps is not known a priori. There are other strategies that you might find interesting in his paper.

arnetheduck commented 4 years ago

oops, this is overdoing things a bit - effectively, we don't ever need anything older than finalized states - finalized states are capped at whatever time period it takes to eject all validators (~2 weeks at some point, don't remember the latest figure).

applying state transformations for non-epoch slots is "free" meaning we don't really need to cache them (probably) - there's a (much lower) upper bound on the number of states that we'll ever actually see and need (for example, validating an attestation can be done with the latest epoch state on the branch of the block it points to and not necessarily the block state itself).

let's reboot the analysis with these caps in mind instead - focusing also on the detailed use cases.

tersec commented 4 years ago

The actual goal of this issue has been achieved. The CTZ and other discussions will serving as useful reference material later notwithstanding.

status-im / nimbus-eth2

Significantly reduce memory usage from state caching #979

Saving memory via a CTZ-based strategy