Open Skydev0h opened 4 years ago
There is actually another thing that is gradually raising in space usage, on the very bottom of the graph. It may be the culprit (I dont see much raise in ValidatorSessionDescriptionImpl, it is still 2048+2048, while the bottom one slowly raises. Arena near the top is slowly becoming colosseum too.
Last prof before OOM reaper indicates that ValidatorSessionDescriptionImpl totally still uses same 4096 MB of memory, while the box in bottom (rocksdb::BlockFetcher::PrepareBufferForBlockFromFile) is constantly rising, slowly but steadily. Judging from the pace of normal memory rise when used, I think that it may be the culprit.
Focusing on that little box those are changes over time: And the colosseum: Looks like a typical memory leak, either in third-party library or improper usage of results. The pace of memory usage nearly corresponds to the pace of increasing memory usage. Due to some GC stuff it is difficult to judge just from memory graph, will analyze later.
Some memory profile of validator over night. Despite the sawtooth, lower edge of it steadily rises over time. And that slow and steady rise has nearly the same pace as rise of that last block I found.
Approximately 200 MB over 14500 seconds (~4 hours), that is 50 MB per hour. That would take, for example, just 13.6 days to fill 16 GB of RAM. That is not counting the sudden spikes that are not reclaimed later.
An excel graph with grid and moving average is more representative: BTW, that RSS drop occurs nearly every 236 seconds to be more precise. At least that is horizontal grid interval.
It may be possible that I have found a fix, but still need several days for testing. May be not. But looks a little more managable with some more shelves. Before: After:
@Skydev0h Glad to see someone tackling this. The solution for now is to use systemctl daemon to restart the validator engine whenever it crashes due to this leak. It's great for testing failure of the core system :)
You're talking about 13.6 days but I've had very mixed results. Sometimes it takes an hour, sometimes a week. It does not correlate with processing incoming messages, you can run a 'master node' (with its own zerostate) and just watch it eat itself.
@ton-blockchain I seem to have stabilized to some extent at least one possible leak factor. Should require more testing, but graph looks more stable now (at least it keeps stable memory usage for more time). May be a first step but not the last. Please notice how it stopped constantly raising memory after cache (the 1GB NewLRUCache(1 << 30) maybe?) got filled up, and now possibly leaks for some another reason (I think that those sudden +1GB and +2GB spikes are arena allocations for ValidatorSessionDescriptionImpl mentioned earlier) I am not even making a PR because it is one line change. This may theoretically degrade perfomance under some circumstances, but otherwise index and filter blocks are stored in never-evicted heap and may limitlessly consume memory. It may be reasonable to set _cache_index_and_filter_blocks_with_highpriority, but I have not yet observed perfomance problems and increased amount of SLOWs in validator logs.
@hortonelectric does the masternode do validation tasks? Or just a simple full node? I did not observe memory leaks for simple full node.
Yes i am talking about a validator... In my regtest setup i use a single node that does everything hence "master"
On Sat, Feb 1, 2020, 11:29 PM Oleksandr notifications@github.com wrote:
@hortonelectric https://github.com/hortonelectric does the masternode do validation tasks? Or just a simple full node? I did not observe memory leaks for simple full node.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ton-blockchain/ton/issues/235?email_source=notifications&email_token=AAKTGRIO6Z2IOG4G6NYVLZDRAWIPHA5CNFSM4KMXKZ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQ7TSI#issuecomment-581040585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKTGRIVAMYWOASH6VB3DA3RAWIPHANCNFSM4KMXKZ5Q .
Got the same problem, memory leak by waves.
When validating, validator-engine memory usage raises gradually over time until the VE gets reaped by OOM killer.
I have tried to collect some evidence with jemalloc (kudos to Arseny for the idea). As for now I suspect that some ValidatorSessionDescriptionImpl are unneccessarily immortal and get collected over time.
Another thing that caught my eye is that ValidatorSessionImpl has a start_up method but has no tear_down method. Should be noted that half of those Descs are spawned in ValidatorSessionImpl::start_up calling ValidatorSessionState::move_to_persistent, another half are created by create_actor.
Some food for thinking (my attempts at building some deltas):