nervosnetwork / ckb

The Nervos CKB is a public permissionless blockchain, and the layer 1 of Nervos network.
https://www.nervos.org
MIT License
1.15k stars 227 forks source link

High-Intensity JSON RPC BatchRequest Causes Server Crash and Unpredictable Downtime #4520

Closed Cupnfish closed 3 weeks ago

Cupnfish commented 1 month ago

Bug Report

Current Behavior

When subjected to high-intensity JSON RPC BatchRequest, the CKB node causes the entire server to crash, rendering it completely inaccessible. The crash occurs after a certain batch size threshold is exceeded, and even after stopping the requests, the server will still crash at an unpredictable time in the future.

Expected Behavior

I expect the node to send an error message to the client when it cannot handle high-intensity batch sizes, rather than attempting to process the requests and causing the server to crash. Ideally, the node should shut down to prevent affecting the entire server.

Environment

Additional context/Screenshots

The issue was first observed when my indexer node encountered the bug and stopped working. I had used high-frequency batch sizes for a few seconds before stopping the requests. Approximately two hours later, the server suddenly became unresponsive.

eval-exec commented 1 month ago

Hello, What are the total memory, available memory, and CPU configuration of this server?

high-intensity JSON RPC BatchRequest

Which RPC causes the entire server to crash?

the CKB node causes the entire server to crash

Can you check the logs of the last crash? Can you find out why it crashed (most likely due to OOM)? You can use sudo journalctl -b -1 to check.

chenyukang commented 1 month ago

I think this PR may help to resolve this issue: https://github.com/nervosnetwork/ckb/pull/4459

it's on 117 version.

chenyukang commented 1 month ago

we have another working in progress PR try to limit the resource spent on heavy RPC: https://github.com/nervosnetwork/ckb/pull/4469

Cupnfish commented 1 month ago

@eval-exec

The server's configuration is not critical. Previously, the server with lower specifications would crash when the batch size reached 1000. With the current server's improved specifications, it crashes when the batch size reaches 2000. The interface causing the crash is get_block_by_number.

Cupnfish commented 1 month ago

Thank you for your efforts, @chenyukang ! We're eagerly looking forward to the launch of this feature to optimize resource usage. Currently, I've already determined the batch size that our server can tolerate, and to avoid any further crashes, I won't be conducting related tests for now. However, I believe this improvement will have a positive impact on our system, and I'm looking forward to the resolution of this issue.

quake commented 1 month ago

rpc batch mode will execute all the requests concurrently and wait for them to complete and return together, notice that you are calling get_block_by_number rpc, because the returning result is a full block data in json format, the size will be very large, a single batch request can easily use up several gigabytes of memory. If there are multiple concurrent batch requests, this will result in OOM.

suggest to limit the batch size to a smaller number and use verbosity=0 to return the block data in molecule format which is much smaller then json. ref: https://github.com/nervosnetwork/ckb/tree/develop/rpc#method-get_block_by_number

eval-exec commented 1 month ago

@Cupnfish Could you provide the output of free -h?

Cupnfish commented 1 month ago

@eval-exec

free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi       7.2Gi       2.0Gi        40Mi       117Gi       118Gi
Swap:             0B          0B          0B
chenyukang commented 3 weeks ago

@Cupnfish in the future release of ckb, you may also add this RPC configuration to limit the batch size of RPC request:

https://github.com/nervosnetwork/ckb/pull/4529/files#diff-d6c5e396f46525d03037cb71857d668d04802896dee4faf52caf8a1619c22b41R137