near / stakewars-iv

12 stars 9 forks source link

Node down with a error - Insufficient resources: could not allocate code memory: Cannot allocate memory (os error 12) #61

Closed SNSMLN closed 4 months ago

SNSMLN commented 4 months ago

My node went down for no reason. In the neard log there is only this

Mar 11 16:57:50 node01new neard[882540]: 2024-03-11T15:57:50.306463Z  INFO stats: #033[1;49;33m#114493881 DnowrgBqgcK4Wo5CGufbhQbX4BpNeT4hGiw1q6a1FkbH#0
33[0m#033[1;49;37m Validator | 20 validators#033[0m#033[1;49;36m 43 peers ⬇ 19.3 MB/s ⬆ 48.3 MB/s#033[0m#033[1;49;32m 1.00 bps 36.9 Tgas/s#033[0m#033[1;
49;34m CPU: 300%, Mem: 3.79 GB#033[0m                                                                                                                   
Mar 11 16:57:54 node01new neard[882540]: 2024-03-11T15:57:54.167114Z  INFO near_network::peer_manager::connection: peer ed25519:2SVMYgTYcEgnxYGyr9XxHUZF
Hf7XfS43DaHNAHnFwgpw disconnected, while sending SyncAccountsData                                                                                       
Mar 11 16:57:54 node01new neard[882540]: 2024-03-11T15:57:54.167128Z  INFO near_network::peer_manager::connection: peer ed25519:2SVMYgTYcEgnxYGyr9XxHUZF
Hf7XfS43DaHNAHnFwgpw disconnected, while sending SyncAccountsData                                                                                       
Mar 11 16:57:55 node01new neard[882540]: thread '<unnamed>' panicked at runtime/runtime/src/actions.rs:172:13:                                          
Mar 11 16:57:55 node01new neard[882540]: Contract runtime failed to load a contrct: Insufficient resources: could not allocate code memory: Cannot alloc
ate memory (os error 12)                                                                                                                                
Mar 11 16:57:55 node01new neard[882540]: stack backtrace:                                                                                               
Mar 11 16:57:55 node01new neard[882540]: thread '<unnamed>' panicked at runtime/runtime/src/actions.rs:172:13:                                          
Mar 11 16:57:55 node01new neard[882540]: Contract runtime failed to load a contrct: Insufficient resources: could not allocate code memory: Cannot alloc
ate memory (os error 12)                                                                                                                                
Mar 11 16:57:55 node01new neard[882540]:    0: rust_begin_unwind                                                                                        
Mar 11 16:57:55 node01new neard[882540]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5             
Mar 11 16:57:55 node01new neard[882540]:    1: core::panicking::panic_fmt                                                                               
Mar 11 16:57:55 node01new neard[882540]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14            
Mar 11 16:57:55 node01new neard[882540]:    2: node_runtime::actions::execute_function_call                                                             
Mar 11 16:57:55 node01new neard[882540]:    3: node_runtime::Runtime::apply_action                                                                      
Mar 11 16:57:55 node01new neard[882540]:    4: node_runtime::Runtime::apply_action_receipt                                                              
Mar 11 16:57:55 node01new neard[882540]:    5: node_runtime::Runtime::apply::{{closure}}                                                                
Mar 11 16:57:55 node01new neard[882540]:    6: node_runtime::Runtime::apply                                                                             
Mar 11 16:57:55 node01new neard[882540]:    7: <nearcore::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk                 
Mar 11 16:57:55 node01new neard[882540]:    8: near_chain::update_shard::apply_new_chunk                                                                
Mar 11 16:57:55 node01new neard[882540]:    9: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute                                        
Mar 11 16:57:55 node01new neard[882540]:   10: rayon_core::registry::WorkerThread::wait_until_cold                                                      
Mar 11 16:57:55 node01new neard[882540]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.                        
Mar 11 16:57:55 node01new neard[882540]: stack backtrace:                                                                                               
Mar 11 16:57:55 node01new neard[882540]:    0: rust_begin_unwind                                                                                        
Mar 11 16:57:56 node01new neard[882540]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5

In syslog :

Mar 11 16:57:56 node01new systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT                                                    
Mar 11 16:57:56 node01new systemd[1]: neard.service: Failed with result 'signal'.                                                                       
Mar 11 16:58:26 node01new systemd[1]: neard.service: Scheduled restart job, restart counter is at 1.                                                    
Mar 11 16:58:26 node01new systemd[1]: Stopped Near node.                                                                                                

Grafana is installed on the server. There's also nothing that attracts attention.

Screenshot_2024-03-11_19-36-08 Screenshot_2024-03-11_19-35-47 Screenshot_2024-03-11_19-37-04 Screenshot_2024-03-11_19-37-19 Screenshot_2024-03-11_19-37-35 Screenshot_2024-03-11_19-37-51 Screenshot_2024-03-11_19-38-08

The only thing. Increased number of memory page faults Screenshot_2024-03-11_20-00-26

walnut-the-cat commented 4 months ago

cc. @alexauroradev @marcelo-gonzalez

pugachAG commented 4 months ago

Looks like an issue due to insufficient memory, but "Memory Basic" dashboard seems to indicate that node still had enough memory. @nagisa this was originated from the runtime, maybe you know what could be the source of this?

pugachAG commented 4 months ago

The issues should be fixed not with https://github.com/near/nearcore/pull/10733 and https://github.com/near/nearcore/pull/10736. So this should no longer happen after we release new neard version for statelessnet.

DDeAlmeida commented 4 months ago
 Mem: 4.41 GB
2024-03-11T18:45:52.651473Z  INFO stats: #114503377 Bp2yKPEKsQREgTbijWLohffYip38gn2X412ipaBv7Ro3 Validator | 20 validators 32 peers ⬇ 14.3 MB/s ⬆ 1.49 MB/s 1.00 bps 0 gas/s CPU: 245%, Mem: 4.43 GB
2024-03-11T18:46:02.652847Z  INFO stats: #114503386 J52mEV3xWyeZtuj2KwS3stz1kFc8fTJa58Hg96935frv Validator | 20 validators 32 peers ⬇ 14.9 MB/s ⬆ 1.76 MB/s 0.90 bps 0 gas/s CPU: 249%, Mem: 4.42 GB
2024-03-11T18:46:12.654089Z  INFO stats: #114503396 C3uD39kbbvXhpFBYp5f85whmCpd32y5ExepeEzEVrzfj Validator | 20 validators 33 peers ⬇ 15.4 MB/s ⬆ 1.87 MB/s 1.00 bps 0 gas/s CPU: 239%, Mem: 4.39 GB
2024-03-11T18:46:20.461053Z  INFO near_network::peer_manager::connection: peer ed25519:AJJ7CC1GKyJAUKVd9xRt9YED99i176b9rm4NZRCYCM1s disconnected, while sending SyncAccountsData
2024-03-11T18:46:20.461069Z  INFO near_network::peer_manager::connection: peer ed25519:AJJ7CC1GKyJAUKVd9xRt9YED99i176b9rm4NZRCYCM1s disconnected, while sending SyncAccountsData
2024-03-11T18:46:20.461074Z  INFO near_network::peer_manager::connection: peer ed25519:AJJ7CC1GKyJAUKVd9xRt9YED99i176b9rm4NZRCYCM1s disconnected, while sending SyncAccountsData
2024-03-11T18:46:20.461078Z  INFO near_network::peer_manager::connection: peer ed25519:AJJ7CC1GKyJAUKVd9xRt9YED99i176b9rm4NZRCYCM1s disconnected, while sending SyncAccountsData
2024-03-11T18:46:22.655366Z  INFO stats: #114503406 4GZMm1cNMviUt1S5MymkJBUxmrEjjbH8E6tbaRX9MZ51 Validator | 20 validators 32 peers ⬇ 15.8 MB/s ⬆ 1.85 MB/s 1.00 bps 0 gas/s CPU: 259%, Mem: 4.38 GB
2024-03-11T18:46:32.657147Z  INFO stats: #114503416 F5vSuq2qWAWYMMbJSNDDAmmKfCResjMAyYgxPjaPdie8 Validator | 20 validators 32 peers ⬇ 15.7 MB/s ⬆ 1.86 MB/s 1.00 bps 0 gas/s CPU: 176%, Mem: 4.34 GB
2024-03-11T18:46:42.658467Z  INFO stats: #114503423 3zSq8mDZvSvSAHNHuyanFnMQBdQofaVdDJy4NyC6yudg Validator | 20 validators 32 peers ⬇ 14.3 MB/s ⬆ 1.75 MB/s 0.70 bps 0 gas/s CPU: 124%, Mem: 4.37 GB
2024-03-11T18:46:52.658933Z  INFO stats: #114503434 AjAA6m2Bow9jeHctRLwvcLDkdxDhv1wt82oEJ9ZoisvR Validator | 20 validators 33 peers ⬇ 13.9 MB/s ⬆ 1.82 MB/s 1.00 bps 0 gas/s CPU: 202%, Mem: 4.41 GB
thread '<unnamed>' panicked at runtime/runtime/src/actions.rs:172:13:
Contract runtime failed to load a contrct: Insufficient resources: could not allocate code memory: Cannot allocate memory (os error 12)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: node_runtime::actions::execute_function_call
   3: node_runtime::Runtime::apply_action
   4: node_runtime::Runtime::apply_action_receipt
   5: node_runtime::Runtime::apply::{{closure}}
   6: node_runtime::Runtime::apply
   7: <nearcore::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
   8: near_chain::update_shard::apply_new_chunk
   9: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aborted

Same here

walnut-the-cat commented 4 months ago

@SNSMLN , @DDeAlmeida could you confirm if the issue is gone now?

SNSMLN commented 4 months ago

@SNSMLN , @DDeAlmeida could you confirm if the issue is gone now?

After updating to build 1.36.1-298-g984f6ad71 the issue no longer appeared

DDeAlmeida commented 4 months ago

Same here