optakt / flow-dps

Flow Data Provisioning Service
Apache License 2.0
29 stars 13 forks source link

Optimize trie structure to use less memory #507

Open Ullaakut opened 2 years ago

Ullaakut commented 2 years ago

We dump the entire values with contracts and everything into the trie, while we only need hashes of those values since we only use the trie for checking whether or not values exist in it, so this would reduce both memory usage and most likely increase indexing speed. Also, branch nodes currently take a lot of memory which could be freed up by using extension nodes instead.

Ullaakut commented 2 years ago

Status report:

For now, I've had to duplicate the code of the following flow-go packages into our own repository and use our implementation instead.

Currently, our forest package contains a mix of our previous forest and the forest type used by FlowGo, which we also use when reading from checkpoints at the moment. I'm working on merging those together to have a forest which contains tries with only hashes and no payloads, so that it takes less space in memory, and a map of paths/payloads which gets cleaned up whenever we call forest.Reset() like for our current implementation.

What is slightly challenging about it is that all of the code from FlowGo relies on reading payloads from the nodes directly rather than the forest, so unfortunately making this work requires quite a bit of refactoring/redesigning. I'm also taking the opportunity to strip down unnecessary elements from their package for our use-case.

The current implementation on master, when indexing the snapshot on my machine, takes 6.3GB of memory. Thhe memory usage just increases linearly as the indexing process moves forward, since the tree gets larger and larger and we never flush the payloads from the nodes. Hopefully the new implementation will consequently reduce the memory usage this brings.

Ideally once we get this working, we can even replace most of the trie implementation with Max's to improve performance further and simplify the code.

Ullaakut commented 2 years ago

Removing the payloads from leaf nodes results in a substantial memory usage reduction:

Flow Go Implementation Memory Usage

Original

Our Implementation Memory Usage

Improved

Further improvements can be made and new benchmarks will be added when this is done.