solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.95k stars 4.15k forks source link

validator uses more memory than expected when metrics are enabled #24319

Open sakridge opened 2 years ago

sakridge commented 2 years ago

Problem

80GB difference between RUST_LOG=solana=info and RUST_LOG=solana=warn for RPC nodes.

Proposed Solution

Relevant question.. is the channel that the metrics are sending to really large? Could there be some large amount of outstanding sends which each is taking some memory and state?

Debug and find why. Possibly reduce the occurrence of spammy metrics.

HaoranYi commented 2 years ago

Initial investigation showed that gossip crds stats logs pretty heavily.

The following pull request addressed this. We trimmed down stats and changed the log level to trace. https://github.com/solana-labs/solana/pull/24132

Next, I will start measure the effect log on memory performance on mainnet beta

HaoranYi commented 2 years ago

@behzadnouri It looks after removing gossip crds stags, the following is the heaviest log. Can we reduce the log level for this?

https://github.com/solana-labs/solana/blob/255ef66a27ad626199230f369d8c3e8afbb45bbf/core/src/window_service.rs#L135

behzadnouri commented 2 years ago

@behzadnouri It looks after removing gossip crds stags, the following is the heaviest log. Can we reduce the log level for this? https://github.com/solana-labs/solana/blob/255ef66a2/core/src/window_service.rs#L135

Those metrics were originally added by @carllin in https://github.com/solana-labs/solana/pull/15694 and then moved from retransmit-stage to window-service in https://github.com/solana-labs/solana/pull/19233. Please also confirm with Carl.

But before that, have you been able to reproduce the 80GB memory increase that the Triton guys are reporting? I have not seen such change in memory when turning logs off or on, so this might be something very RPC specific and a result of RPC requests handling; in which case we should first look into RPC logs and metrics.

HaoranYi commented 2 years ago

I profiled a normal validator run on mainnet-beta with info and warning log. I don't see much difference on memory usage. At info level, we are submitting about 1000 points every 10s. At warning level, we barely submitting any points. Does anyone know how to start a RPC node?

behzadnouri commented 2 years ago

Does anyone know how to start a RPC node?

Look at rpc and accounts index related flags in validator flags:--full-rpc-api, --no-voting, --enable-rpc-transaction-history, --account-index, ... https://github.com/solana-labs/solana/blob/e3ef0741b/validator/src/main.rs#L461-L1839

But I am suspecting without an actual rpc request load, just spinning up an rpc node will not reproduce the issue.

HaoranYi commented 2 years ago

My gut feelings are in line with @sakridge. From my profiling, I don't think we are leaking any memory for the data_point. But the actually channel could be growing if the agent is not able to catch up...

We could address this by limiting the queue size and drop any submissions when the queue is full.

https://github.com/solana-labs/solana/pull/24341

sakridge commented 2 years ago

@behzadnouri It looks after removing gossip crds stags, the following is the heaviest log. Can we reduce the log level for this? https://github.com/solana-labs/solana/blob/255ef66a2/core/src/window_service.rs#L135

Those metrics were originally added by @carllin in #15694 and then moved from retransmit-stage to window-service in #19233. Please also confirm with Carl.

But before that, have you been able to reproduce the 80GB memory increase that the Triton guys are reporting? I have not seen such change in memory when turning logs off or on, so this might be something very RPC specific and a result of RPC requests handling; in which case we should first look into RPC logs and metrics.

Joe and Axl where able to reproduce it. The results for RPC where much worse than non-RPC although they said the RPC node was not actually serving RPC.

Usually for RPC, one would enable these flags to enable the secondary account indexes and transaction logging:

--account-indexes program-id
--account-indexes spl-token-mint
--account-indexes spl-token-owner
--enable-rpc-transaction-history
--enable-cpi-and-log-storage
--enable-extended-tx-metadata-storage # maybe not on all RPC?
HaoranYi commented 2 years ago

--account_indexes program-id --account_indexes spl-token-mint --account_indexes spl-token-owner --enable-rpc-transaction-history --enable-cpi-and-log-storage --enable-extended-tx-metadata-storage # maybe not on all RPC?

OK. Let me try it.

HaoranYi commented 2 years ago

@t-nelson found another log spam from rpc to reduce.

https://github.com/solana-labs/solana/pull/24335

HaoranYi commented 2 years ago

With https://github.com/solana-labs/solana/pull/24341, I am able to get a stable run of rpc node at info level. With memory usage stable around 270G.

But I do see metric queue full from time to time (2000 max), such as the case below.

[2022-04-14T17:14:49.036055753Z ERROR solana_metrics::metrics] haha submitting 2561 points
[2022-04-14T17:14:49.040787510Z ERROR solana_metrics::metrics] haha command queue size: 1
[2022-04-14T17:14:49.040845571Z ERROR solana_metrics::metrics] haha command queue size: 2
[2022-04-14T17:14:49.040880957Z ERROR solana_metrics::metrics] haha command queue size: 3
[2022-04-14T17:14:49.040903809Z ERROR solana_metrics::metrics] haha command queue size: 4
[2022-04-14T17:14:49.040934786Z ERROR solana_metrics::metrics] haha command queue size: 5
[2022-04-14T17:14:49.040983423Z ERROR solana_metrics::metrics] haha command queue size: 6
[2022-04-14T17:14:49.041009167Z ERROR solana_metrics::metrics] haha command queue size: 7
[2022-04-14T17:14:49.041039374Z ERROR solana_metrics::metrics] haha command queue size: 8
[2022-04-14T17:14:49.041062290Z ERROR solana_metrics::metrics] haha command queue size: 9
[2022-04-14T17:14:49.041088574Z ERROR solana_metrics::metrics] haha command queue size: 10
[2022-04-14T17:14:49.041111270Z ERROR solana_metrics::metrics] haha command queue size: 11
[2022-04-14T17:14:49.041135302Z ERROR solana_metrics::metrics] haha command queue size: 12
...
[2022-04-14T17:14:49.194224724Z ERROR solana_metrics::metrics] haha command queue size: 1997
[2022-04-14T17:14:49.194255626Z ERROR solana_metrics::metrics] haha command queue size: 1998
[2022-04-14T17:14:49.194304618Z ERROR solana_metrics::metrics] haha command queue size: 1999
[2022-04-14T17:14:49.194336753Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194395145Z WARN  solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194407783Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194433316Z WARN  solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194459347Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194474065Z WARN  solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194487980Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194509550Z WARN  solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194520645Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194525507Z WARN  solana_metrics::metrics] metric queue full, point dropped.
...

Later on, we recovered.

[2022-04-14T17:14:49.496694764Z INFO  solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=129722522i last_index=0i num_repaired=1i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=false is_dead=false
[2022-04-14T17:14:49.496714853Z INFO  solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=129722521i last_index=0i num_repaired=1i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=false is_dead=false
[2022-04-14T17:14:49.496721335Z INFO  solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=129722520i last_index=0i num_repaired=1i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=false is_dead=false
HaoranYi commented 2 years ago

https://github.com/solana-labs/solana/blob/0e7b0597db3533d304a009d4276164ddb78049df/ledger/src/slot_stats.rs#L51 @jbiseda Can we reduce the log for this?

behzadnouri commented 2 years ago

https://github.com/solana-labs/solana/blob/0e7b0597d/ledger/src/slot_stats.rs#L51

@jbiseda Can we reduce the log for this?

@HaoranYi I don't think rpc nodes emit that metric more often than other nodes, so I don't think it is the issue here.

HaoranYi commented 2 years ago

OK. Make sense. Let me log the actual points that are dropped.

jeffwashington commented 2 years ago

24348

Is this related here?

HaoranYi commented 2 years ago

24348

Is this related here?

No, I don't think so. It is more metric datapoint related.

jbiseda commented 2 years ago

In case this helps @HaoranYi here's a sample of top points generated from a validator on testnet. I wasn't running any particular rpc workload but used the following flags:

    --account-index program-id
    --account-index spl-token-mint
    --account-index spl-token-owner
    --enable-rpc-transaction-history

datapoints and counts:

("receive_window_num_slot_shreds", 8768), ("retransmit-stage-slot-stats", 7461), ("slot_stats_tracking_complete", 4557), ("bank-new_from_parent-heights", 3946), ("bank-timestamp-correction", 3946), ("bank-executor-cache-stats", 3946), ("bank-timestamp", 3946), ("shred_insert_is_full", 3943), ("compute_bank_stats-best_slot", 3921), ("bank_weight", 3921), ("collect_rent_eagerly", 3920), ("accounts_cache_size", 3920), ("replay-slot-stats", 3920), ("slot_repeated_writes", 3920), ("load_to_collect_rent_eagerly_scan_elapsed", 3920), ("per_program_timings", 3920), ("blocks_produced", 3843), ("validator-confirmation", 3836), ("tower-vote", 3786), ("block-commitment-cache", 3786), ("optimistic_slot", 3689), ("bank-forks_set_root", 3600), ("retransmit-first-shred", 3545), ("optimistic_slot_elapsed", 3414), ("poh_slot_timing", 2521)

and counters:

("vote-native", 11763907), ("bank-process_transactions-sigs", 1182664), ("bank-process_transactions-txs", 1181666), ("bank-process_transactions", 1177898), ("bank-process_transactions-error-instruction_error", 253095), ("bank-process_transactions-error_count", 253078), ("shrink_total_preliminary_candidate_stores", 18067), ("shrink_select_top_sparse_storage_entries-ms", 18067), ("shrink_select_top_sparse_storage_entries-seeds", 18067), ("cluster_info-purge-count", 17213), ("cluster_info-get_votes-count", 17163)

both of these are processed through the same queue, so maybe counters are causing congestion?

changes: git ls-remote https://github.com/jbiseda/solana.git meta-metrics

CriesofCarrots commented 2 years ago

I see a few every-slot datapoints we could probably quiet

HaoranYi commented 2 years ago

Wow. So many counters. Counters are bad in term of performance to submit. First, they are store in hashtable, unlike data point stored in vector. Second, they have to be converted into data point. We should move away from using counters.

On Thu, Apr 14, 2022, 5:58 PM Tyera Eulberg @.***> wrote:

I see a few every-slot datapoints we could probably quiet

— Reply to this email directly, view it on GitHub https://github.com/solana-labs/solana/issues/24319#issuecomment-1099701266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVSJE4ASVEVUR4G4OO3STVFCPHLANCNFSM5TLRLOKQ . You are receiving this because you were mentioned.Message ID: @.***>

t-nelson commented 2 years ago

We should move away from using counters.

YES! They're also a pain in the ass to view 'cause each one is its own measurement in influx

HaoranYi commented 2 years ago

I have started a rpc node with the above changes. Let's see if the metric agent can catch up after the logging reduction.

behzadnouri commented 2 years ago

@HaoranYi I think there are 2 separate issues:

  1. spamy metrics and counters which some are indeed wasteful.
  2. rpc nodes taking 80GB more memory with info log versus warn or error log.
    • I personally at least have not observed such change in memory in a non rpc node when turning logs on or off; which makes me think that most likely the culprits are not the metrics/counters which are independent of rpc and accounts-index.

Addressing (1) may indeed require to remove some of the metrics. But jumping to fix (2) by removing frequent metrics might not be the right thing to do. I believe (2) needs more diagnosis to figure out why an rpc node behaves significantly worse;

Also @yhchiang-sol have recently added tons of metrics to rocksdb: https://github.com/solana-labs/solana/pulls?q=is%3Apr+author%3Ayhchiang-sol+is%3Aclosed I would guess those are also relevant here as well.

carllin commented 2 years ago

Those metrics were originally added by @carllin in https://github.com/solana-labs/solana/pull/15694 >and then moved from retransmit-stage to window-service in https://github.com/solana->labs/solana/pull/19233. Please also confirm with Carl.

@behzadnouri thanks for making me aware of this. A bit surprised because I thought once every 2 seconds would be conservative, but I guess some really old slots could be sloshing around the network.

Yeah I think it's fine to remove these now since we no longer rely on this metric to determine how many shreds arrived over turbine for a given slot, we have blockstore metrics for that now in slot_stats_tracking_complete: https://github.com/steviez/solana/blob/2c49c5d7d8c3c81b3d22c1160192ed851c48462f/ledger/src/blockstore.rs#L169.

HaoranYi commented 2 years ago

Yeah I think it's fine to remove these now since we no longer rely on this metric to determine how many shreds arrived over turbine for a given slot, we have blockstore metrics for that now in slot_stats_tracking_complete: https://github.com/steviez/solana/blob/2c49c5d7d8c3c81b3d22c1160192ed851c48462f/ledger/src/blockstore.rs#L169.

OK. I will change it into debug logging.

yhchiang-sol commented 2 years ago
  • spamy metrics and counters which some are indeed wasteful.

I also noticed that there're several places in the code where the metrics are reported in 100% sampling rate. Ideally, metrics should be properly sampled so that we get just enough samples and minimize any potential overhead.

  • rpc nodes taking 80GB more memory with info log versus warn or error log.

Also @yhchiang-sol have recently added tons of metrics to rocksdb: https://github.com/solana-labs/solana/pulls?q=is%3Apr+author%3Ayhchiang-sol+is%3Aclosed I would guess those are also relevant here as well.

The default sampling rate has been reduced to 0 so it should be no-op by default. I am working on some PRs to optimize its reporting code path and sampling rate.

t-nelson commented 1 year ago

this may be related to how we stuff all of pending datapoints into a single string before submitting them.

https://github.com/solana-labs/solana/blob/b20024c7051597876604f7a50353c8a9a61bebab/metrics/src/metrics.rs#L97

we can minimize this by streaming the request body instead. body() takes an Into<Body> which can be anything that implements std::io::Read. should be pretty simple to build a streamer from Vec<Datapoint> around that

steviez commented 1 year ago

this may be related to how we stuff all of pending datapoints into a single string before submitting them.

https://github.com/solana-labs/solana/blob/b20024c7051597876604f7a50353c8a9a61bebab/metrics/src/metrics.rs#L97

we can minimize this by streaming the request body instead. body() takes an Into<Body> which can be anything that implements std::io::Read. should be pretty simple to build a streamer from Vec<Datapoint> around that

I wasn't subscribed to this issue, but happened to have it open in a tab ha. In any case, I had a non-voting validator node that I was already restarting anyways so I threw an info! to log some accumulated lens. With the tip of master, I do not see any calculated len exceeding 500k in the last 10 hours of runtime. Granted there are extra DataPoints for leaders/rpc/geyser that my node isn't capturing, but I think this gives us a good ballpark.

And as another note, metrics is telling me that there are typically just under 900 points for each submission (every 10 seconds). This isn't bad, but streaming without pre-calculating the length would save us a second iteration of the Vec<DataPoint>.

@t-nelson - If you still want to leave this for some hypothetical contributor, thoughts on breaking into a separate issue? This issue is a year+ old, and breaking it out so there is less info to parse might be friendlier for someone picking this up as their true first issue.

Separate to your idea - could be interesting to run experiment again and confirm that we've shrunken the initially quoted 80 GB memory bloat so potentially worth it to keep the issue open until someone repeats the experiment.