optakt / flow-dps

Flow Data Provisioning Service
Apache License 2.0
29 stars 13 forks source link

Improve trie structure to reduce memory usage #513

Closed Ullaakut closed 2 years ago

Ullaakut commented 2 years ago

Goal of this PR

This PR implements #507

Checklist

Todo

Tests

All covered.

Does the PR require documentation to be added or updated?

Maybe in the future.

Misc

Ullaakut commented 2 years ago

Status Report

My current implementation uses one forest type for both what we need in the mapper, which is to trace the steps between multiple state commitments and their respective tries, and the features needed by Flow Go for a Forest.

All node/trie/forest flattening is also reimplemented by our package, in a way that fits our use case. Now the problem is that the FlowGo checkpoints contain encoded nodes which are not using our format but theirs. Their checkpoints do not have extension nodes or anything like that, they just contain standard tries that are full of branches.

Potential solutions include:

If you don't want to phrase your opinion but want to "vote" on your favorite solution, feel free to do so using the GH reaction that corresponds to the outlined proposal.

ricardomgoncalves commented 2 years ago

Status Report

My current implementation uses one forest type for both what we need in the mapper, which is to trace the steps between multiple state commitments and their respective tries, and the features needed by Flow Go for a Forest.

All node/trie/forest flattening is also reimplemented by our package, in a way that fits our use case. Now the problem is that the FlowGo checkpoints contain encoded nodes which are not using our format but theirs. Their checkpoints do not have extension nodes or anything like that, they just contain standard tries that are full of branches.

Potential solutions include:

* Build the trie as-is in the checkpoint, and therefore save no memory (this is a bad solution, but it's there) [tada]

* Only look at leaves when reading the checkpoint, and reinsert them one by one into a new trie to get this new trie to have the optimal structure [heart]

* Build the trie as-is in the checkpoint, and optimize it afterwards by trimming unnecessary branches (would be a somewhat complex operation, but could theoretically be done progressively while the trie is in use) [rocket]

If you don't want to phrase your opinion but want to "vote" on your favorite solution, feel free to do so using the GH reaction that corresponds to the outlined proposal.

From what I read, I think the question would be like, does it matter to have a long loading phase [heart] otherwise [rocket]. I think the [heart] would be simple enough and would bring more benefits.

Maelkum commented 2 years ago

Currently I'm leaning towards the second option. Using the third option would basically mean that, in total, the memory requirements would remain the same as they are now, right? Even though the memory usage would drop eventually, one would still have to have it available for the initial phase..?

What are the downsides of ❤️ ? (This is the closest I've ever been to writing a country song 😄 )

Ullaakut commented 2 years ago

Currently I'm leaning towards the second option. Using the third option would basically mean that, in total, the memory requirements would remain the same as they are now, right? Even though the memory usage would drop eventually, one would still have to have it available for the initial phase..?

What are the downsides of ❤️ ? (This is the closest I've ever been to writing a country song 😄 )

It would mean that initially we load a large trie and progressively trim it off. Basically the trade off is more memory usage in exchange for start up speed. For the second solution, the inconvenient is that it might be a bit slower to start off, since instead of just loading nodes, we need to recreate the trie completely. We don't need to recompute hashes however so it might not be that bad.

Another issue with solution #2 is that checkpoints contain a forest of tries, and if they have multiple tries, they are stored as a slice of tries and a slice of nodes, and the only way to know which nodes belong to which tries is by building their trie currently, since the nodes have indexes pointing to their left and right children, and the tries just have the index of their root node in the node slice. If we want to read only leaves and insert them, we'll need to figure out a way to know which trie they belong to. This is what I'm planning on trying to figure out this morning :p

Ullaakut commented 2 years ago

Status Report

In the end, the chosen solution was to read all of the nodes and then, one by one, read the leaves from each trie and rebuild a new trie from those leaves. This required the following:

This seems to work, however another issue came up: creating a new trie with those leaves, without having their payloads in the payload store, means that if new leaves are added in the future and we need to recompute the hash of a moved node, we might not be able to, and this would result in a fatal error. Because of this, we need to populate our store while we read the checkpoint. Encoded leaves contain payloads, which can be inserted in the store using the leaf's hash as the key.

This makes processing a checkpoint an order of magnitude or two slower than it was without it. When we didn't process the payloads, processing a 30GB checkpoint took about 20mns on my machine, but when we need to write payloads (in memory at first, but periodically/on evictions on disk) this process now takes over 8 hours (I say over 8 because I stopped it at that point, and it was almost done processing the nodes and inserting things on disk, so I assume it would have taken 9-10 hours tops).

This, plus the fact that it takes quite some memory to load all of the nodes from the checkpoint, lead me to believe that I reached a point where my workflow's bottleneck is that the checkpoint I have is too large. I need a smaller one to process, which still contains realistic payloads and a complete trie. Otherwise, I'll have to wait 10 hours between each test run of my changes, which isn't great since what I'm currently working on is what happens after we process all of the nodes, which is the longest part of the process.

In order to speed things up, I'll try including multiple writes bundled together in a single transaction, the same way we do in the Flow DPS. This creates a potentially fatal edge case that will need to be dealt with however: if a payload is part of a hanging transaction, waiting to be written on disk, it is temporarily no longer accessible and if a new insertion requires it, it might not find this payload and therefore not be able to recompute the node hash, which would result in a crash. We could prevent this in many ways.

FIXMEs left

To be done later

Ullaakut commented 2 years ago

Status Report

I am currently working on the following fixes:

In Progress

Forest tests. There is a lot of work to do on the forest stuff, as I said before this is the most volatile part of the PR at the moment.

Ullaakut commented 2 years ago

As Faye highlighted it, there is currently an issue with this implementation which is that upon inserting new elements in tries, even if a new trie is created from the original one, the original trie gets mutated, while it should not. This might be why we have a mismatch on the mainnet-14 checkpoint, and it's a major different between both implementations. It needs to be fixed before the PR can be merged.