Open sakridge opened 2 years ago
Initial investigation showed that gossip crds stats logs pretty heavily.
The following pull request addressed this. We trimmed down stats and changed the log level to trace. https://github.com/solana-labs/solana/pull/24132
Next, I will start measure the effect log on memory performance on mainnet beta
@behzadnouri It looks after removing gossip crds stags, the following is the heaviest log. Can we reduce the log level for this?
@behzadnouri It looks after removing gossip crds stags, the following is the heaviest log. Can we reduce the log level for this? https://github.com/solana-labs/solana/blob/255ef66a2/core/src/window_service.rs#L135
Those metrics were originally added by @carllin in https://github.com/solana-labs/solana/pull/15694 and then moved from retransmit-stage to window-service in https://github.com/solana-labs/solana/pull/19233. Please also confirm with Carl.
But before that, have you been able to reproduce the 80GB memory increase that the Triton guys are reporting? I have not seen such change in memory when turning logs off or on, so this might be something very RPC specific and a result of RPC requests handling; in which case we should first look into RPC logs and metrics.
I profiled a normal validator run on mainnet-beta with info and warning log. I don't see much difference on memory usage. At info level, we are submitting about 1000 points every 10s. At warning level, we barely submitting any points. Does anyone know how to start a RPC node?
Does anyone know how to start a RPC node?
Look at rpc and accounts index related flags in validator flags:--full-rpc-api
, --no-voting
, --enable-rpc-transaction-history
, --account-index
, ...
https://github.com/solana-labs/solana/blob/e3ef0741b/validator/src/main.rs#L461-L1839
But I am suspecting without an actual rpc request load, just spinning up an rpc node will not reproduce the issue.
My gut feelings are in line with @sakridge. From my profiling, I don't think we are leaking any memory for the data_point. But the actually channel could be growing if the agent is not able to catch up...
We could address this by limiting the queue size and drop any submissions when the queue is full.
@behzadnouri It looks after removing gossip crds stags, the following is the heaviest log. Can we reduce the log level for this? https://github.com/solana-labs/solana/blob/255ef66a2/core/src/window_service.rs#L135
Those metrics were originally added by @carllin in #15694 and then moved from retransmit-stage to window-service in #19233. Please also confirm with Carl.
But before that, have you been able to reproduce the 80GB memory increase that the Triton guys are reporting? I have not seen such change in memory when turning logs off or on, so this might be something very RPC specific and a result of RPC requests handling; in which case we should first look into RPC logs and metrics.
Joe and Axl where able to reproduce it. The results for RPC where much worse than non-RPC although they said the RPC node was not actually serving RPC.
Usually for RPC, one would enable these flags to enable the secondary account indexes and transaction logging:
--account-indexes program-id
--account-indexes spl-token-mint
--account-indexes spl-token-owner
--enable-rpc-transaction-history
--enable-cpi-and-log-storage
--enable-extended-tx-metadata-storage # maybe not on all RPC?
--account_indexes program-id --account_indexes spl-token-mint --account_indexes spl-token-owner --enable-rpc-transaction-history --enable-cpi-and-log-storage --enable-extended-tx-metadata-storage # maybe not on all RPC?
OK. Let me try it.
@t-nelson found another log spam from rpc to reduce.
With https://github.com/solana-labs/solana/pull/24341, I am able to get a stable run of rpc node at info level. With memory usage stable around 270G.
But I do see metric queue full from time to time (2000 max), such as the case below.
[2022-04-14T17:14:49.036055753Z ERROR solana_metrics::metrics] haha submitting 2561 points
[2022-04-14T17:14:49.040787510Z ERROR solana_metrics::metrics] haha command queue size: 1
[2022-04-14T17:14:49.040845571Z ERROR solana_metrics::metrics] haha command queue size: 2
[2022-04-14T17:14:49.040880957Z ERROR solana_metrics::metrics] haha command queue size: 3
[2022-04-14T17:14:49.040903809Z ERROR solana_metrics::metrics] haha command queue size: 4
[2022-04-14T17:14:49.040934786Z ERROR solana_metrics::metrics] haha command queue size: 5
[2022-04-14T17:14:49.040983423Z ERROR solana_metrics::metrics] haha command queue size: 6
[2022-04-14T17:14:49.041009167Z ERROR solana_metrics::metrics] haha command queue size: 7
[2022-04-14T17:14:49.041039374Z ERROR solana_metrics::metrics] haha command queue size: 8
[2022-04-14T17:14:49.041062290Z ERROR solana_metrics::metrics] haha command queue size: 9
[2022-04-14T17:14:49.041088574Z ERROR solana_metrics::metrics] haha command queue size: 10
[2022-04-14T17:14:49.041111270Z ERROR solana_metrics::metrics] haha command queue size: 11
[2022-04-14T17:14:49.041135302Z ERROR solana_metrics::metrics] haha command queue size: 12
...
[2022-04-14T17:14:49.194224724Z ERROR solana_metrics::metrics] haha command queue size: 1997
[2022-04-14T17:14:49.194255626Z ERROR solana_metrics::metrics] haha command queue size: 1998
[2022-04-14T17:14:49.194304618Z ERROR solana_metrics::metrics] haha command queue size: 1999
[2022-04-14T17:14:49.194336753Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194395145Z WARN solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194407783Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194433316Z WARN solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194459347Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194474065Z WARN solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194487980Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194509550Z WARN solana_metrics::metrics] metric queue full, point dropped.
[2022-04-14T17:14:49.194520645Z ERROR solana_metrics::metrics] haha command queue size: 2000
[2022-04-14T17:14:49.194525507Z WARN solana_metrics::metrics] metric queue full, point dropped.
...
Later on, we recovered.
[2022-04-14T17:14:49.496694764Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=129722522i last_index=0i num_repaired=1i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=false is_dead=false
[2022-04-14T17:14:49.496714853Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=129722521i last_index=0i num_repaired=1i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=false is_dead=false
[2022-04-14T17:14:49.496721335Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=129722520i last_index=0i num_repaired=1i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=false is_dead=false
https://github.com/solana-labs/solana/blob/0e7b0597db3533d304a009d4276164ddb78049df/ledger/src/slot_stats.rs#L51 @jbiseda Can we reduce the log for this?
https://github.com/solana-labs/solana/blob/0e7b0597d/ledger/src/slot_stats.rs#L51
@jbiseda Can we reduce the log for this?
@HaoranYi I don't think rpc nodes emit that metric more often than other nodes, so I don't think it is the issue here.
OK. Make sense. Let me log the actual points that are dropped.
Is this related here?
24348
Is this related here?
No, I don't think so. It is more metric datapoint related.
In case this helps @HaoranYi here's a sample of top points generated from a validator on testnet. I wasn't running any particular rpc workload but used the following flags:
--account-index program-id
--account-index spl-token-mint
--account-index spl-token-owner
--enable-rpc-transaction-history
datapoints and counts:
("receive_window_num_slot_shreds", 8768), ("retransmit-stage-slot-stats", 7461), ("slot_stats_tracking_complete", 4557), ("bank-new_from_parent-heights", 3946), ("bank-timestamp-correction", 3946), ("bank-executor-cache-stats", 3946), ("bank-timestamp", 3946), ("shred_insert_is_full", 3943), ("compute_bank_stats-best_slot", 3921), ("bank_weight", 3921), ("collect_rent_eagerly", 3920), ("accounts_cache_size", 3920), ("replay-slot-stats", 3920), ("slot_repeated_writes", 3920), ("load_to_collect_rent_eagerly_scan_elapsed", 3920), ("per_program_timings", 3920), ("blocks_produced", 3843), ("validator-confirmation", 3836), ("tower-vote", 3786), ("block-commitment-cache", 3786), ("optimistic_slot", 3689), ("bank-forks_set_root", 3600), ("retransmit-first-shred", 3545), ("optimistic_slot_elapsed", 3414), ("poh_slot_timing", 2521)
and counters:
("vote-native", 11763907), ("bank-process_transactions-sigs", 1182664), ("bank-process_transactions-txs", 1181666), ("bank-process_transactions", 1177898), ("bank-process_transactions-error-instruction_error", 253095), ("bank-process_transactions-error_count", 253078), ("shrink_total_preliminary_candidate_stores", 18067), ("shrink_select_top_sparse_storage_entries-ms", 18067), ("shrink_select_top_sparse_storage_entries-seeds", 18067), ("cluster_info-purge-count", 17213), ("cluster_info-get_votes-count", 17163)
both of these are processed through the same queue, so maybe counters are causing congestion?
changes: git ls-remote https://github.com/jbiseda/solana.git meta-metrics
I see a few every-slot datapoints we could probably quiet
Wow. So many counters. Counters are bad in term of performance to submit. First, they are store in hashtable, unlike data point stored in vector. Second, they have to be converted into data point. We should move away from using counters.
On Thu, Apr 14, 2022, 5:58 PM Tyera Eulberg @.***> wrote:
I see a few every-slot datapoints we could probably quiet
— Reply to this email directly, view it on GitHub https://github.com/solana-labs/solana/issues/24319#issuecomment-1099701266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVSJE4ASVEVUR4G4OO3STVFCPHLANCNFSM5TLRLOKQ . You are receiving this because you were mentioned.Message ID: @.***>
We should move away from using counters.
YES! They're also a pain in the ass to view 'cause each one is its own measurement in influx
I have started a rpc node with the above changes. Let's see if the metric agent can catch up after the logging reduction.
@HaoranYi I think there are 2 separate issues:
Addressing (1) may indeed require to remove some of the metrics. But jumping to fix (2) by removing frequent metrics might not be the right thing to do. I believe (2) needs more diagnosis to figure out why an rpc node behaves significantly worse;
Also @yhchiang-sol have recently added tons of metrics to rocksdb: https://github.com/solana-labs/solana/pulls?q=is%3Apr+author%3Ayhchiang-sol+is%3Aclosed I would guess those are also relevant here as well.
Those metrics were originally added by @carllin in https://github.com/solana-labs/solana/pull/15694 >and then moved from retransmit-stage to window-service in https://github.com/solana->labs/solana/pull/19233. Please also confirm with Carl.
@behzadnouri thanks for making me aware of this. A bit surprised because I thought once every 2 seconds would be conservative, but I guess some really old slots could be sloshing around the network.
Yeah I think it's fine to remove these now since we no longer rely on this metric to determine how many shreds arrived over turbine for a given slot, we have blockstore metrics for that now in slot_stats_tracking_complete
: https://github.com/steviez/solana/blob/2c49c5d7d8c3c81b3d22c1160192ed851c48462f/ledger/src/blockstore.rs#L169.
Yeah I think it's fine to remove these now since we no longer rely on this metric to determine how many shreds arrived over turbine for a given slot, we have blockstore metrics for that now in
slot_stats_tracking_complete
: https://github.com/steviez/solana/blob/2c49c5d7d8c3c81b3d22c1160192ed851c48462f/ledger/src/blockstore.rs#L169.
OK. I will change it into debug logging.
- spamy metrics and counters which some are indeed wasteful.
I also noticed that there're several places in the code where the metrics are reported in 100% sampling rate. Ideally, metrics should be properly sampled so that we get just enough samples and minimize any potential overhead.
- rpc nodes taking 80GB more memory with info log versus warn or error log.
Also @yhchiang-sol have recently added tons of metrics to rocksdb: https://github.com/solana-labs/solana/pulls?q=is%3Apr+author%3Ayhchiang-sol+is%3Aclosed I would guess those are also relevant here as well.
The default sampling rate has been reduced to 0 so it should be no-op by default. I am working on some PRs to optimize its reporting code path and sampling rate.
this may be related to how we stuff all of pending datapoints into a single string before submitting them.
we can minimize this by streaming the request body instead. body()
takes an Into<Body>
which can be anything that implements std::io::Read
. should be pretty simple to build a streamer from Vec<Datapoint>
around that
this may be related to how we stuff all of pending datapoints into a single string before submitting them.
we can minimize this by streaming the request body instead.
body()
takes anInto<Body>
which can be anything that implementsstd::io::Read
. should be pretty simple to build a streamer fromVec<Datapoint>
around that
I wasn't subscribed to this issue, but happened to have it open in a tab ha. In any case, I had a non-voting validator node that I was already restarting anyways so I threw an info!
to log some accumulated len
s. With the tip of master, I do not see any calculated len
exceeding 500k in the last 10 hours of runtime. Granted there are extra DataPoints
for leaders/rpc/geyser that my node isn't capturing, but I think this gives us a good ballpark.
And as another note, metrics is telling me that there are typically just under 900 points for each submission (every 10 seconds). This isn't bad, but streaming without pre-calculating the length would save us a second iteration of the Vec<DataPoint>
.
@t-nelson - If you still want to leave this for some hypothetical contributor, thoughts on breaking into a separate issue? This issue is a year+ old, and breaking it out so there is less info to parse might be friendlier for someone picking this up as their true first issue.
Separate to your idea - could be interesting to run experiment again and confirm that we've shrunken the initially quoted 80 GB memory bloat so potentially worth it to keep the issue open until someone repeats the experiment.
Problem
80GB difference between RUST_LOG=solana=info and RUST_LOG=solana=warn for RPC nodes.
Proposed Solution
Relevant question.. is the channel that the metrics are sending to really large? Could there be some large amount of outstanding sends which each is taking some memory and state?
Debug and find why. Possibly reduce the occurrence of spammy metrics.