Closed marta-lokhova closed 9 months ago
There are two main reasons for the limits:
It's possible that network usage could greatly increase the size of the BucketList. Without these limits, this could overwhelm downstream systems with the amount of entries being evicted and could increase ledger close time since core needs to block on random disk IO when scanning for eviction. While these config limits do open the door to an iterator eviction loop, where the eviction scan is stuck on a given level, I view this as a less severe problem than increased close time or downstream system strain. When the iterator cannot advance to the next level because the current level is too large, we still evict entries on the "stuck" level. This is not optimal, since we will not evict entries on lower levels, but getting stuck does not imply that eviction stops all together. Additionally, the cost of not evicting is relatively low. While the BucketList size will not be reduced, since from the perspective of a TX there is no difference between the expired and evicted state, the validator otherwise behaves properly.
The starting network parameters are set such that the BucketList could grow by approximately 3-4x before getting stuck in this loop (this is a rough estimate, based on the total BucketList size, one abnormally large Bucket on a lower level could trigger the iterator loop even if the entire BucketList did not grow very much). I've also added "incomplete-scan" metric that keeps track of any Buckets that are large enough such that we get stuck on that level. The intention is whenever this counter increases, we should vote to increase the scan bounds.
I'm not advocating against the limits; those are obviously needed to keep close times sane. The point of the issue is to analyze each bucket level estimated size as well as its spill frequency, and determine how much buffer we have before this becomes an issue (if some levels have a small or no buffer, this is a problem). Ideally, we should have estimates for each level (given the current BL size, plus an estimated future size, e.g. 3x).
Additionally, the cost of not evicting is relatively low. While the BucketList size will not be reduced, since from the perspective of a TX there is no difference between the expired and evicted state, the validator otherwise behaves properly.
I'm not sure I agree. The whole point of eviction scans is to prevent state bloat; and if there's a configuration of the BucketList, where we can't evict entries at all, that sounds like a a design flaw.
The intention is whenever this counter increases, we should vote to increase the scan bounds.
This would only work as long as validators can handle an increased scan bound (due to the same close time/downstream systems overload reasons). There might be other ways to handle this "stuck loop": for example, maybe we don't need to stay on the same bucket, and instead move on? eventually the entries that were stuck on a certain level will get spilled, and we increase the likelihood of eviction by iterating on lower levels, since those spill at lower frequencies (this depends on how likely we are to finish scanning a lower level though).
Again, I don't think this would be an issue on testnet, but I think it's an important question to answer for pubnet.
We discussed having this existing metric (stuck bucket) in Grafana. Having a Prometheus alert will also help.
We should add this metric to Grafana and make a Prometheus alert for it.
Currently, we scan a certain number of bytes every ledger close. In addition, whenever BucketList spills, we reset the scan iterator back to 0 and begin scanning the new bucket. This means that depending on the size of the bucket, we might not finish scanning the bucket before it spills. This is problematic:
We should analyze expected bucket sizes, as well as the eviction scan size, and determine if there's sufficient buffer to finish scanning buckets, so we do not get into iterator reset loop. We probably won't have this issue on testnet, since buckets there tend to be small.