solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.35k stars 4.35k forks source link

Solana Validator 1.6.7 is failing with CUDA enabled #17304

Closed mabalaru closed 2 years ago

mabalaru commented 3 years ago

Validator configuration:

solana-validator 1.6.7 (src:ebb5fc12; feat:3458834192)

4x Nvidia 2080 TI
256GB RAM
1TB PMEM drive

Ubuntu 20.04.2 LTS
Nvidia: Driver Version: 465.27       CUDA Version: 11.3

Problem

Validator is failing when CUDA is enabled with the following error :

May 18 11:43:29 m2-solana01 solana-validator[4867]: thread 'solana-replay-stage' panicked at 'cudaHostRegister error: 2 ptr: 0x7f0d491a1000 bytes: 2560', /var/lib/buildkite-agent/builds/bananas-1/solana-labs/solana-secondary/perf/src/cuda_runtime.rs:33:17
May 18 11:43:29 m2-solana01 solana-validator[4867]: stack backtrace:
May 18 11:43:29 m2-solana01 solana-validator[4867]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 2 ptr: 0x7eff6ed7d700 bytes: 167936', perf/src/cuda_runtime.rs:33:17
May 18 11:43:29 m2-solana01 solana-validator[4867]: [2021-05-18T08:43:29.491517951Z INFO  solana_metrics::metrics] datapoint: retransmit-first-shred slot=78930907i
May 18 11:43:29 m2-solana01 solana-validator[4867]:    0: rust_begin_unwind
May 18 11:43:29 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
May 18 11:43:29 m2-solana01 solana-validator[4867]:    1: std::panicking::begin_panic_fmt
May 18 11:43:29 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
May 18 11:43:29 m2-solana01 solana-validator[4867]:    2: solana_perf::cuda_runtime::PinnedVec<T>::resize
May 18 11:43:29 m2-solana01 solana-validator[4867]:    3: <[solana_ledger::entry::Entry] [2021-05-18T08:43:29.502453020Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=524i slot=78930906i last_index=328i
May 18 11:43:29 m2-solana01 solana-validator[4867]: as solana_ledger::entry::EntrySlice>::start_verify
May 18 11:43:29 m2-solana01 solana-validator[4867]:    4: solana_ledger::blockstore_processor::confirm_slot
May 18 11:43:29 m2-solana01 solana-validator[4867]:    5: solana_core::replay_stage::ReplayStage::replay_active_banks
May 18 11:43:29 m2-solana01 solana-validator[4867]:    6: solana_core::replay_stage::ReplayStage::new::{{closure}}
May 18 11:43:29 m2-solana01 solana-validator[4867]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
May 18 11:43:29 m2-solana01 solana-validator[4867]: stack backtrace:
May 18 11:43:29 m2-solana01 solana-validator[4867]: [2021-05-18T08:43:29.507497560Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'cudaHostRegister error: 2 ptr: 0x7f0d491a1000 bytes: 2560', /var/lib/buildkite-agent/builds/bananas-1/solana-labs/solana-secondary/perf/src/cuda_runtime.rs:33:17" location="/var/lib/buildkite-agent/builds/bananas-1/solana-labs/solana-secondary/perf/src/cuda_runtime.rs:33:17"
May 18 11:43:29 m2-solana01 solana-validator[4867]:    0: rust_begin_unwind
May 18 11:43:29 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:[2021-05-18T08:43:29.507574086Z INFO  solana_metrics::metrics] submitting 293 points
May 18 11:43:29 m2-solana01 solana-validator[4867]: 493:5
May 18 11:43:29 m2-solana01 solana-validator[4867]:    1: std::panicking::begin_panic_fmt
May 18 11:43:29 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
May 18 11:43:29 m2-solana01 solana-validator[4867]:    2: solana_perf::cuda_runtime::pin
May 18 11:43:29 m2-solana01 solana-validator[4867]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
May 18 11:43:29 m2-solana01 solana-validator[4867]:    4: solana_perf::packet::Packets::new_with_recycler
May 18 11:43:29 m2-solana01 solana-validator[4867]:    5: solana_streamer::streamer::recv_loop
May 18 11:43:29 m2-solana01 solana-validator[4867]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
May 18 11:43:30 m2-solana01 solana-validator[4867]: [2021-05-18T08:43:30.362620869Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-receiver" one=1i message="panicked at 'cudaHostRegister error: 2 ptr: 0x7eff6ed7d700 bytes: 167936', perf/src/cuda_runtime.rs:33:17" location="perf/src/cuda_runtime.rs:33:17"
May 18 11:43:30 m2-solana01 solana-validator[4867]: [2021-05-18T08:43:30.362657960Z INFO  solana_metrics::metrics] submitting 3 points
May 18 11:43:30 m2-solana01 solana-validator[4867]: total_gpus: 4
May 18 11:43:30 m2-solana01 solana-validator[4867]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 4 ptr: 0x7f0901d0c140 bytes: 167936', perf/src/cuda_runtime.rs:33:17
May 18 11:43:30 m2-solana01 solana-validator[4867]: stack backtrace:
May 18 11:43:30 m2-solana01 solana-validator[4867]:    0: rust_begin_unwind
May 18 11:43:30 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
May 18 11:43:30 m2-solana01 solana-validator[4867]:    1: std::panicking::begin_panic_fmt
May 18 11:43:30 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
May 18 11:43:30 m2-solana01 solana-validator[4867]:    2: solana_perf::cuda_runtime::pin
May 18 11:43:30 m2-solana01 solana-validator[4867]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
May 18 11:43:30 m2-solana01 solana-validator[4867]:    4: solana_perf::packet::Packets::new_with_recycler
May 18 11:43:30 m2-solana01 solana-validator[4867]:    5: solana_streamer::streamer::recv_loop
May 18 11:43:30 m2-solana01 solana-validator[4867]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
May 18 11:43:30 m2-solana01 solana-validator[4867]: thread 'solana-listen' panicked at 'cudaHostUnregister returned: 4 ptr: 0x7f0d848ca400', /var/lib/buildkite-agent/builds/bananas-1/solana-labs/solana-secondary/perf/src/cuda_runtime.rs:51:17
May 18 11:43:30 m2-solana01 solana-validator[4867]: stack backtrace:
May 18 11:43:30 m2-solana01 solana-validator[4867]:    0: rust_begin_unwind
May 18 11:43:30 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
May 18 11:43:30 m2-solana01 solana-validator[4867]:    1: std::panicking::begin_panic_fmt
May 18 11:43:30 m2-solana01 solana-validator[4867]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
May 18 11:43:50 m2-solana01 systemd[1]: solana.service: Main process exited, code=exited, status=1/FAILURE
May 18 11:43:50 m2-solana01 systemd[1]: solana.service: Failed with result 'exit-code'.
May 18 11:43:51 m2-solana01 systemd[1]: solana.service: Scheduled restart job, restart counter is at 10.
May 18 11:43:51 m2-solana01 systemd[1]: Stopped Solana Validator.
May 18 11:43:51 m2-solana01 systemd[1]: Started Solana Validator.

Proposed Solution

validator should work with CUDA enabled.

sakridge commented 3 years ago

cc @ryoqun @behzadnouri @carllin

The Recycler fixes should be constraining this usage right? Are there any missing in 1.6 that might help?

behzadnouri commented 3 years ago

From the error code, this seems like an out of memory error, right?

cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory to perform the requested operation.

The changes I had made are not backported to v1.6. If this is an oom problem, I am also not expecting that they will help.

sakridge commented 3 years ago

From the error code, this seems like an out of memory error, right?

cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory to perform the requested operation.

The changes I had made are not backported to v1.6. If this is an oom problem, I am also not expecting that they will help.

Yea. It is OOM.

mabalaru commented 3 years ago

I have an update on this situation. I downgraded the server to solana 1.5.19 and it's working fine for more then 24hrs. So the solana upgrade 1.6.x is causing this issue.

mabalaru commented 3 years ago

Hello, I did some testing with 1.6.10 version. This version is more stable and it takes 2-3 days to get to a panic like this:

Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:03.139374444Z INFO  solana_metrics::metrics] datapoint: banking_stage-loop-stats id=0i process_packets_count=876i new_tx_count=0i dropped_batches_count=0i newly_buffered_packets_count=221i current_buffered_packets_count=1065i rebuffered_packets_count=0i consume_buffered_packets_elapsed=0i process_packets_elapsed=3038i handle_retryable_packets_elapsed=0i filter_pending_packets_elapsed=0i packet_duplicate_check_elapsed=614i packet_conversion_elapsed=0i transaction_processing_elapsed=0i
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:03.396988224Z INFO  solana_metrics::metrics] datapoint: shred_fetch_tvu_forwards index_overrun=0i shred_count=541i slot_bad_deserialize=0i index_bad_deserialize=0i index_out_of_bounds=0i slot_out_of_range=0i duplicate_shred=0i
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 2 ptr: 0x7f45e6ab8300 bytes: 167936', perf/src/cuda_runtime.rs:33:17
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: stack backtrace:
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:    0: rust_begin_unwind
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:    1: std::panicking::begin_panic_fmt
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:    2: solana_perf::cuda_runtime::pin
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:    4: solana_perf::packet::Packets::new_with_recycler
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]:    5: solana_streamer::streamer::recv_loop
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:03.458706709Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-receiver" one=1i message="panicked at 'cudaHostRegister error: 2 ptr: 0x7f45e6ab8300 bytes: 167936', perf/src/cuda_runtime.rs:33:17" location="perf/src/cuda_runtime.rs:33:17"
Jun  2 08:02:03 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:03.458778236Z INFO  solana_metrics::metrics] submitting 197 points
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:04.000809442Z INFO  solana_runtime::accounts_db] finalize_dead_slot_removal: slots [80605618, 80605549, 80605602, 80605608, 80605550, 80605624, 80605537, 80605565, 80605600, 80605588, 80605563, 80605543, 80605529, 80605532, 80605595, 80605591, 80605539, 80605641, 80605561, 80605564, 80605548, 80605653, 80605536, 80605617, 80605551, 80605643, 80605648, 80605633, 80605644, 80605640, 80605642, 80605566, 80605649, 80605586, 80605584, 80605621, 80605560, 80605610, 80605629, 80605627, 80605647, 80605544, 80605612, 80605634, 80605534, 80605525, 80605615, 80605646, 80605542, 80605592, 80605531, 80605597, 80605611, 80605645, 80605596, 80605562, 80605650]
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 2 ptr: 0x7f37129d3f80 bytes: 167936', perf/src/cuda_runtime.rs:33:17
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: stack backtrace:
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    0: rust_begin_unwind
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    1: std::panicking::begin_panic_fmt
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    2: solana_perf::cuda_runtime::pin
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    4: solana_perf::packet::Packets::new_with_recycler
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    5: solana_streamer::streamer::recv_loop
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:04.133892003Z INFO  solana_metrics::metrics] datapoint: banking_stage-loop-stats id=3i process_packets_count=0i new_tx_count=0i dropped_batches_count=0i newly_buffered_packets_count=0i current_buffered_packets_count=0i rebuffered_packets_count=0i consume_buffered_packets_elapsed=0i process_packets_elapsed=0i handle_retryable_packets_elapsed=0i filter_pending_packets_elapsed=0i packet_duplicate_check_elapsed=0i packet_conversion_elapsed=0i transaction_processing_elapsed=0i
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:04.133911343Z INFO  solana_metrics::metrics] datapoint: shred_fetch index_overrun=0i shred_count=17i slot_bad_deserialize=0i index_bad_deserialize=0i index_out_of_bounds=0i slot_out_of_range=0i duplicate_shred=0i
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:04.133951504Z INFO  solana_metrics::metrics] datapoint: poh-service ticks=176i hashes=2207744i elapsed_us=365594i total_sleep_us=0i total_tick_time_us=272i total_lock_time_us=69i total_hash_time_us=1001647i total_record_time_us=0i
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:04.133974456Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-receiver" one=1i message="panicked at 'cudaHostRegister error: 2 ptr: 0x7f37129d3f80 bytes: 167936', perf/src/cuda_runtime.rs:33:17" location="perf/src/cuda_runtime.rs:33:17"
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: [2021-06-02T05:02:04.133990708Z INFO  solana_metrics::metrics] submitting 10 points
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: reqwest::Error { kind: Builder, source: Normal(ErrorStack([])) }', metrics/src/metrics.rs:107:18
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: stack backtrace:
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    0: rust_begin_unwind
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    1: core::panicking::panic_fmt
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/panicking.rs:92:14
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    2: core::option::expect_none_failed
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/option.rs:1300:5
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    3: <solana_metrics::metrics::InfluxDbMetricsWriter as solana_metrics::metrics::MetricsWriter>::write
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    4: solana_metrics::metrics::MetricsAgent::write
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]:    5: solana_metrics::metrics::MetricsAgent::run
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun  2 08:02:04 m2-solana01 solana-validator[3206030]: total_gpus: 4
Jun  2 08:02:30 m2-solana01 systemd[1]: solana.service: Main process exited, code=exited, status=1/FAILURE
Jun  2 08:02:30 m2-solana01 systemd[1]: solana.service: Failed with result 'exit-code'.

Before this, it happened again on: May 31 11:47:38

May 31 11:47:38 m2-solana01 solana-validator[1504659]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 2 ptr: 0x7f10188654c0 bytes: 167936', perf/src/cuda_runtime.rs:33:17
May 31 11:47:38 m2-solana01 solana-validator[1504659]: stack backtrace:
May 31 11:47:38 m2-solana01 solana-validator[1504659]:    0: rust_begin_unwind
May 31 11:47:38 m2-solana01 solana-validator[1504659]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
May 31 11:47:38 m2-solana01 solana-validator[1504659]:    1: std::panicking::begin_panic_fmt
May 31 11:47:38 m2-solana01 solana-validator[1504659]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
May 31 11:47:38 m2-solana01 solana-validator[1504659]:    2: solana_perf::cuda_runtime::pin
May 31 11:47:38 m2-solana01 solana-validator[1504659]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
May 31 11:47:38 m2-solana01 solana-validator[1504659]:    4: solana_perf::packet::Packets::new_with_recycler
May 31 11:47:38 m2-solana01 solana-validator[1504659]:    5: solana_streamer::streamer::recv_loop
May 31 11:47:38 m2-solana01 solana-validator[1504659]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
May 31 11:47:38 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:38.607163761Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-receiver" one=1i message="panicked at 'cudaHostRegister error: 2 ptr: 0x7f10188654c0 bytes: 167936', perf/src/cuda_runtime.rs:33:17" location="perf/src/cuda_runtime.rs:33:17"
May 31 11:47:38 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:38.607209230Z INFO  solana_metrics::metrics] submitting 135 points
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.443855818Z INFO  solana_runtime::accounts_db] finalize_dead_slot_removal: slots [80337641, 80337679, 80337670, 80337592, 80337626, 80337600, 80337665, 80337694, 80337637, 80337677, 80337672, 80337616, 80337615, 80337673, 80337605, 80337606, 80337647, 80337645, 80337625, 80337623, 80337658, 80337644, 80337632, 80337587, 80337586, 80337598, 80337643, 80337652, 80337624, 80337687, 80337666, 80337595, 80337669, 80337654, 80337656, 80337675, 80337667, 80337618, 80337664, 80337638, 80337636, 80337640, 80337599, 80337617, 80337668, 80337607, 80337674, 80337693, 80337646, 80337614, 80337676, 80337622, 80337678, 80337655, 80337695, 80337685, 80337659, 80337684, 80337671, 80337621, 80337593, 80337657, 80337619, 80337653, 80337612, 80337686, 80337692, 80337613]
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.456632353Z INFO  solana_core::repair_service] repair_stats: [(80769831, 11), (80769824, 1120)]
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.554988934Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=770i slot=80769824i last_index=1056i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555014895Z INFO  solana_metrics::metrics] datapoint: poh-service ticks=173i hashes=2170112i elapsed_us=372228i total_sleep_us=0i total_tick_time_us=196i total_lock_time_us=48i total_hash_time_us=1002557i total_record_time_us=0i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555066203Z INFO  solana_metrics::metrics] datapoint: banking_stage-loop-stats id=3i process_packets_count=0i new_tx_count=0i dropped_batches_count=0i newly_buffered_packets_count=0i current_buffered_packets_count=0i rebuffered_packets_count=0i consume_buffered_packets_elapsed=0i process_packets_elapsed=0i handle_retryable_packets_elapsed=0i filter_pending_packets_elapsed=0i packet_duplicate_check_elapsed=0i packet_conversion_elapsed=0i transaction_processing_elapsed=0i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555074298Z INFO  solana_metrics::metrics] datapoint: shred_fetch_repair index_overrun=0i shred_count=684i slot_bad_deserialize=0i index_bad_deserialize=0i index_out_of_bounds=0i slot_out_of_range=0i duplicate_shred=304i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555091904Z INFO  solana_metrics::metrics] datapoint: banking_stage-loop-stats id=1i process_packets_count=0i new_tx_count=0i dropped_batches_count=0i newly_buffered_packets_count=0i current_buffered_packets_count=0i rebuffered_packets_count=0i consume_buffered_packets_elapsed=0i process_packets_elapsed=0i handle_retryable_packets_elapsed=0i filter_pending_packets_elapsed=0i packet_duplicate_check_elapsed=0i packet_conversion_elapsed=0i transaction_processing_elapsed=0i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555097737Z INFO  solana_metrics::metrics] datapoint: banking_stage-loop-stats id=2i process_packets_count=0i new_tx_count=0i dropped_batches_count=0i newly_buffered_packets_count=0i current_buffered_packets_count=0i rebuffered_packets_count=0i consume_buffered_packets_elapsed=0i process_packets_elapsed=0i handle_retryable_packets_elapsed=0i filter_pending_packets_elapsed=0i packet_duplicate_check_elapsed=0i packet_conversion_elapsed=0i transaction_processing_elapsed=0i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555149677Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair repair-total=1131i shred-count=1131i highest-shred-count=0i orphan-count=0i repair-highest-slot=0i repair-orphan=0i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555165647Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair-timing set-root-elapsed=31i get-votes-elapsed=622i add-votes-elapsed=2003i get-best-orphans-elapsed=1258i get-best-shreds-elapsed=11177i send-repairs-elapsed=12427i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555565490Z INFO  solana_metrics::metrics] datapoint: accounts_db_store_timings hash_accounts=48234i store_accounts=18847i update_index=3406i handle_reclaims=12i append_accounts=0i find_storage=0i num_accounts=2174i total_data=173438325i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.555579348Z INFO  solana_metrics::metrics] datapoint: accounts_db_store_timings2 recycle_store_count=0i current_recycle_store_count=1001i current_recycle_store_bytes=3113361408i create_store_count=0i store_get_slot_store=0i store_find_existing=0i dropped_stores=68i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.556919286Z INFO  solana_metrics::metrics] datapoint: recv-window-insert-shreds num_shreds=6076i total_elapsed=1228537i insert_lock_elapsed=0i insert_shreds_elapsed=60234i shred_recovery_elapsed=1126713i chaining_elapsed=118i commit_working_sets_elapsed=5253i write_batch_elapsed=29458i num_inserted=2661i num_repair=7i num_recovered=967i num_recovered_inserted=967i num_recovered_failed_sig=0i num_recovered_failed_invalid=0i num_recovered_exists=0i
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.557725041Z INFO  solana_metrics::counter] COUNTER:{"name": "bank-process_transactions-txs", "counts": 167184118, "samples": 147884000,  "now": 1622450859557, "events": 0}
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.557774479Z INFO  solana_metrics::counter] COUNTER:{"name": "bank-process_transactions", "counts": 226163319, "samples": 172811000,  "now": 1622450859557, "events": 1}
May 31 11:47:39 m2-solana01 solana-validator[1504659]: [2021-05-31T08:47:39.557801232Z INFO  solana_metrics::counter] COUNTER:{"name": "bank-process_transactions-sigs", "counts": 203576439, "samples": 147884000,  "now": 1622450859557, "events": 1}
May 31 11:47:39 m2-solana01 solana-validator[1504659]: ERR: driver shutting down cuda-ecc-ed25519/gpu_ctx.cu 68
May 31 11:47:39 m2-solana01 solana-validator[1504659]: solana-validator: common/gpu_common.h:22: void cuda_assert(cudaError_t, const char*, int): Assertion `0' failed.
May 31 11:47:39 m2-solana01 solana-validator[1504659]: thread 'solana-listen' panicked at 'cudaHostUnregister returned: 4 ptr: 0x7f1031dbcb40', /var/lib/buildkite-agent/builds/froome-1/solana-labs/solana-secondary/perf/src/cuda_runtime.rs:51:17
May 31 11:47:39 m2-solana01 solana-validator[1504659]: stack backtrace:
May 31 11:47:39 m2-solana01 solana-validator[1504659]:    0: rust_begin_unwind
May 31 11:47:39 m2-solana01 solana-validator[1504659]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
May 31 11:47:39 m2-solana01 solana-validator[1504659]:    1: std::panicking::begin_panic_fmt
May 31 11:47:39 m2-solana01 solana-validator[1504659]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
May 31 11:48:06 m2-solana01 solana-validator[3206030]: [2021-05-31T08:48:06.151106572Z INFO  solana_validator] solana-validator 1.6.10 (src:5d4654d2; feat:3533521759)

Thank you in advance for any help.

mabalaru commented 3 years ago

ran solana 1.6.10 with cuda 10.1 as suggested into the documentation. The error is still present after 1h of running:

Jun  3 01:54:41 m2-solana01 solana-validator[3875]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 2 ptr: 0x7f9b78a1fd80 bytes: 167936', perf/src/cuda_runtime.rs:33:17
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: stack backtrace:
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.016598222Z INFO  solana_core::repair_service] repair_stats: [(81144180, 127), (81144181, 447)]
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.016625217Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair repair-total=574i shred-count=574i highest-shred-count=0i orphan-count=0i repair-highest-slot=0i repair-orphan=0i
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.016637150Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair-timing set-root-elapsed=41i get-votes-elapsed=1127i add-votes-elapsed=2353i get-best-orphans-elapsed=534i get-best-shreds-elapsed=7023i send-repairs-elapsed=7099i
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.017157813Z INFO  solana_metrics::metrics] datapoint: retransmit-first-shred slot=81144182i
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.045406109Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=1212i slot=81144181i last_index=630i
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    0: rust_begin_unwind
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    1: std::panicking::begin_panic_fmt
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    2: solana_perf::cuda_runtime::pin
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    4: solana_perf::packet::Packets::new_with_recycler
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    5: solana_streamer::streamer::recv_loop
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.067728440Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-receiver" one=1i message="panicked at 'cudaHostRegister error: 2 ptr: 0x7f9b78a1fd80 bytes: 167936', perf/src/cuda_runtime.rs:33:17" location="perf/src/cuda_runtime.rs:33:17"
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: [2021-06-02T22:54:41.067759175Z INFO  solana_metrics::metrics] submitting 83 points
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: thread 'solana-receiver' panicked at 'cudaHostRegister error: 2 ptr: 0x7fa955875ec0 bytes: 167936', perf/src/cuda_runtime.rs:33:17
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: stack backtrace:
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    0: rust_begin_unwind
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    1: std::panicking::begin_panic_fmt
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:              at ./rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    2: solana_perf::cuda_runtime::pin
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    3: solana_perf::cuda_runtime::PinnedVec<T>::reserve_and_pin
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    4: solana_perf::packet::Packets::new_with_recycler
Jun  3 01:54:41 m2-solana01 solana-validator[3875]:    5: solana_streamer::streamer::recv_loop
Jun  3 01:54:41 m2-solana01 solana-validator[3875]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
sakridge commented 2 years ago

Try with newer versions and re-open if still present.

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.