Closed mvines closed 4 years ago
Here's a partial stake trace from one of the SLP validators that appears to show evidence of an OOM as well:
slots: {929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.723454696Z INFO solana_core::cluster_info] ReceiveUpdates took: 12 ms len: 2
[2020-01-21T10:32:43.772950966Z INFO solana_core::cluster_info] ReceiveUpdates took: 41 ms len: 1
[2020-01-21T10:32:43.780096885Z ERROR solana_core::cluster_info_repair_listener] Serving repair for slot 929749 to Luna1VCsPBE4hghuHaL9UFgimBB3V6u6johyd7hGXBL. Repairee slots: {929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.784794499Z INFO solana_runtime::bank] bank frozen: 934779 hash: 4x6j7MxQjwDKxtDGytJjLMCdLUAvoAFCnTXxG4E98Bzy accounts_delta: BankHash 6a7b61ce9f59f3967b50d8613d6b01f63aca1914ef6b617c0e550c1c7ff32d6e signature_count: 44 last_blockhash: C64YnnkxUMLEpdc84PmjTdq8Kwv4TiAtQo3AV77R2WV2
[2020-01-21T10:32:43.785188261Z INFO solana_runtime::bank] accounts hash slot: 934779 stats: BankHashStats { num_removed_accounts: 44, num_added_accounts: 0, num_lamports_stored: 10051303410401, total_data_len: 45540, num_executable_accounts: 0 }
[2020-01-21T10:32:43.791000183Z WARN solana_core::replay_stage] 3o43fXxTpndqVsCdMi16WNq6aR9er75P364ZkTHrEQJN slot_weight: 934779 15424587707071582546506452 319932299605495709067931415168 934778
[2020-01-21T10:32:43.795203564Z INFO solana_core::replay_stage] voting: 934779 319932299605495709067931415168
[2020-01-21T10:32:43.807357083Z INFO solana_ledger::bank_forks] setting snapshot root: 934748
[2020-01-21T10:32:43.803665925Z INFO solana_core::cluster_info] ReceiveUpdates took: 8 ms len: 3
[2020-01-21T10:32:43.807162093Z ERROR solana_core::cluster_info_repair_listener] Serving repair for slot 929749 to ArpeD4LKYgza1o6aR5xNTQX3hxeik8URxWNQVpA8wirV. Repairee slots: {929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.834852993Z INFO solana_core::cluster_info] ReceiveUpdates took: 17 ms len: 2
[2020-01-21T10:32:43.867447010Z INFO solana_core::poh_recorder] reset poh from: FVgxSvaJK41JUr1tK237m1NkQ5xteH1QHwmpsUtjD2CR,59825884,934778 to: C64YnnkxUMLEpdc84PmjTdq8Kwv4TiAtQo3AV77R2WV2,934779
[2020-01-21T10:32:43.867579731Z INFO solana_core::replay_stage] 3o43fXxTpndqVsCdMi16WNq6aR9er75P364ZkTHrEQJN reset PoH to tick 59825920 (within slot 934779). I am not
in the leader schedule yet
[2020-01-21T10:32:43.867601990Z INFO solana_core::replay_stage] vote bank: Some(934779) reset bank: 934779
[2020-01-21T10:32:43.870812331Z INFO solana_core::replay_stage] new fork:934780 parent:934779 root:934748
[2020-01-21T10:32:43.921748901Z INFO solana_core::cluster_info] ReceiveUpdates took: 12 ms len: 1
[2020-01-21T10:32:43.935683367Z ERROR solana_core::cluster_info_repair_listener] Serving repair for slot 929749 to SPC3m89qwxGbqYdg1GuaoeZtgJD2hYoob6c4aKLG1zu. Repairee slots: {929716, 929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.990848178Z INFO solana_metrics::metrics] submit response: 204 No Content
[2020-01-21T10:32:44.006634927Z INFO solana_metrics::metrics] submitting 125 points
[2020-01-21T10:32:44.032318910Z INFO solana_core::cluster_info] RequestWindowIndex took: 38 ms
thread '<unnamed>' panicked at 'failed to allocate an alternative stack', src/libstd/sys/unix/stack_overflow.rs:134:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
[2020-01-21T10:32:48.784804395Z INFO solana_metrics::metrics] submitting 979 points
thread '<unnamed>' panicked at 'failed to allocate an alternative stack', src/libstd/sys/unix/stack_overflow.rs:134:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
fatal runtime error: failed to initiate panic, error 5
Aborted
and
Out of memory: Kill process 582 (solana-validato) score 958 or sacrifice child
[13101.676767] Killed process 582 (solana-validato) total-vm:77370672kB, anon-rss:30397572kB, file-rss:0kB, shmem-rss:0kB
[13106.238504] oom_reaper: reaped process 582 (solana-validato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Might be tough since this is all in a pile of bash right now, and we're not going back to shipping bash
This one is fixed, right?
No it's not fixed yet. In our dev environment, there are some bash scripts that run in the background of each node that report memory usage into InfluxDB:
I rolled this out on our nodes, https://github.com/solana-labs/cluster/commit/115ef8f92b0abc10565f5180569c1e4e057529bd. I don't see a nice way to deploy this to the entire fleet of nodes so monitoring our nodes will have to due for now
I've had reports of at least two SLP validator's node experiencing OOMs. But the grafana memory graph isn't working, so it's hard to determine if others are also experiencing unusual memory usage.
Sample dmesg output: