hirosystems / ordhook

Build indexers, standards and protocols on top of Ordinals and Inscriptions (BRC20, etc).
Apache License 2.0
182 stars 54 forks source link

ordhook frequently running out of memory in a kubernetes container #245

Closed 4ker-dep closed 3 months ago

4ker-dep commented 8 months ago

Hi! I'm currently running ordhook as a container inside a k8s cluster on a node with the following characteristics:

OS: linux (amd64) OS Image: Amazon Linux 2 Kernel version: 5.10.199-190.747.amzn2.x86_64 Container runtime: containerd://1.7.2 Kubelet version: v1.24.17-eks-e71965b

The latest logs can be seen below:

{"msg":"Completed ordinal number retrieval for Satpoint 0x05bbb59c01d9f3f44490c551856549d417993e426fcb26b06c5d79cbcaad13d6:0:0 (block: #89596:3943833505, transfers: 817, progress: 357/6103, priority queue: true, thread: 156)","level":"INFO","ts":"2024-01-08T15:55:08.046252351Z"}
{"msg":"Completed ordinal number retrieval for Satpoint 0x1a72b28b5cc48b1764e547d19036184d9ad1cb01f3b4af5a6b66812a8e0c76d7:0:0 (block: #792260:503784574, transfers: 108, progress: 358/6103, priority queue: true, thread: 157)","level":"INFO","ts":"2024-01-08T15:55:08.046280667Z"}
{"msg":"Completed ordinal number retrieval for Satpoint 0x4a9cd716190a772699c24406d9e0555cf1453295a661a02d0ea597670505a4d8:0:0 (block: #688454:299151663, transfers: 424, progress: 359/6103, priority queue: true, thread: 158)","level":"INFO","ts":"2024-01-08T15:55:08.046308098Z"}
{"msg":"Completed ordinal number retrieval for Satpoint 0x0706338d6d7e13d17fd67af60309c8e351c5a0245b03c14e5cf6fe4bdb917ed9:0:0 (block: #89596:3943820770, transfers: 817, progress: 360/6103, priority queue: true, thread: 159)","level":"INFO","ts":"2024-01-08T15:55:08.046334941Z"}
{"msg":"Starting service...","level":"INFO","ts":"2024-01-08T15:55:28.536621703Z"}
{"msg":"Checking database integrity up to block #797798","level":"INFO","ts":"2024-01-08T15:55:28.871883967Z"}

My current setup is described in a manifest of a stateful-set type on a node with 30 GiB total memory. It seems to be working till it reaches the currently set memory limit and restarts due to: Reason:Reason: OOMKilled - exit code: 137 and I'm not sure it this is a memory leak or some k8s specific issue. With different values set for the memory limit the usage seem to be near the allowed max but eventually crosses it regardless of the value in place.

Screenshot 2024-01-08 at 17 09 44

I've dug through the whole source code of the configuration and there is very little to adjust. Needless to say I've tried modifying the max_caching_memory_size_mb with no apparent effect on the described above issue. Is there a way to better handle resources via ordhook? I think it would be really beneficial to have an application level defined memory limit to avoid frequent restarts and time consuming integrity checks.

lgalabru commented 8 months ago

@4ker-dep thank you for opening this issue. Could you tell me more about the frequency of these incidents? Which version are you running? We're also experiencing this sometimes, this issue is #1 on my hit list now that Jubilee is behind us. I agree that some application level settings would be helpful, this was the initial intention of max_caching_memory_size_mb, but I've been iterating a few times on how this key is being used, and the current approach might not be the best approach.

lgalabru commented 8 months ago

Also could you tell me more about your disk configuration and the performances that you're experiencing with your setup?

4ker-dep commented 8 months ago

@4ker-dep thank you for opening this issue. Could you tell me more about the frequency of these incidents? Which version are you running? We're also experiencing this sometimes, this issue is #1 on my hit list now that Jubilee is behind us. I agree that some application level settings would be helpful, this was the initial intention of max_caching_memory_size_mb, but I've been iterating a few times on how this key is being used, and the current approach might not be the best approach.

Hi @lgalabru. With the current configuration:

[storage]
working_dir = "/opt/ordhook"

# The Http Api allows you to register / deregister
# dynamically predicates.
# Disable by default.
#
 [http_api]
 http_port = 20456
 database_uri = "xxx"

[network]
mode = "mainnet"
bitcoind_rpc_url = "xxx"
bitcoind_rpc_username = "xxx"
bitcoind_rpc_password = "xxx"
# Bitcoin block events can be received by Chainhook
# either through a Bitcoin node's ZeroMQ interface,
# or through the Stacks node. Zmq is being
# used by default:
bitcoind_zmq_url = "xxx"
# but stacks can also be used:
# stacks_node_rpc_url = "xxx"

[limits]
max_number_of_bitcoin_predicates = 100
max_number_of_concurrent_bitcoin_scans = 100
max_number_of_processing_threads = 200
bitcoin_concurrent_http_requests_max = 16
max_caching_memory_size_mb = 20000

# Disable the following section if the state
# must be built locally
[bootstrap]
download_url = "https://archive.hiro.so/mainnet/ordhook/mainnet-ordhook-sqlite-latest"

[logs]
ordinals_internals = true
chainhook_internals = true

and the limits of: 25GiB the restarts occured at the following timestamps:

Started at: 2024-01-09T07:43:26+01:00
Finished at: 2024-01-09T09:39:16+01:00
Started at: 2024-01-09T09:39:17+01:00
Finished at: 2024-01-09T10:14:01+01:00
Started at: 2024-01-09T10:14:02+01:00
Finished at: 2024-01-09T11:36:55+01:00

Those are 3 out of 23 that happened in the last 22h.

Also could you tell me more about your disk configuration and the performances that you're experiencing with your setup?

Screenshot 2024-01-09 at 11 40 43 The attached disk is a gp3 with IOPS: 3000 Throughput: 125 MiB/s

4ker-dep commented 8 months ago

@lgalabru, could you please provide tips on running ordhook with the use of redis? Would it be utilize at this stage of data retrival? I've set the database_uri in the configuration file but I see no access logs on the other side.

4ker-dep commented 8 months ago

I forgot to mention, the used version is: fix/database-optims

lgalabru commented 8 months ago

@4ker-dep that branch is definitely not recommended. it's outdated + the improvements were not what I was looking for, this PR was closed a moment ago. I'm currently doing some experimentation on this branch https://github.com/hirosystems/ordhook/pull/250 Will report back here.

ss22219 commented 7 months ago

32G server will still run out of memory when using pr250 20240131115948

version: '3'
services:
  ordhook-pr250:
    container_name: ordhook-250
    user: 1000:1000
    image: hirosystems/ordhook:pr-250
    volumes:
      - ./:/data/.ordhook
    command: service start --config-path=/data/.ordhook/Ordhook.toml
    network_mode: "host"
    restart: on-failure
[storage]
working_dir = "/data/.ordhook"

# The Http Api allows you to register / deregister
# dynamically predicates.
# Disable by default.
#
[http_api]
http_port = 20456
# database_uri = "redis://localhost:6379/"
[network]
mode = "mainnet"
bitcoind_rpc_url = "http://127.0.0.1:8332"
bitcoind_rpc_username = "devnet"
bitcoind_rpc_password = "devnet"
# Bitcoin block events can be received by Chainhook
# either through a Bitcoin node's ZeroMQ interface,
# or through the Stacks node. Zmq is being
# used by default:
bitcoind_zmq_url = "tcp://127.0.0.1:18543"
# but stacks can also be used:
# stacks_node_rpc_url = "http://0.0.0.0:20443"

[limits]
max_number_of_bitcoin_predicates = 8
max_number_of_concurrent_bitcoin_scans = 8
max_number_of_processing_threads = 8
bitcoin_concurrent_http_requests_max = 8
max_caching_memory_size_mb = 3200

# Disable the following section if the state
# must be built locally
[bootstrap]
# download_url = "https://archive.hiro.so/mainnet/ordhook/mainnet-ordhook-sqlite-pr-235-latest"

[logs]
ordinals_internals = true
chainhook_internals = true

[resources]
ulimit = 2048
cpu_core_available = 8
memory_available = 3200
bitcoind_rpc_threads = 8
bitcoind_rpc_timeout = 30
expected_observers_count = 1
ss22219 commented 7 months ago

up to 60G 微信截图_20240131162614 微信截图_20240131163108 微信截图_20240131163357

4ker-dep commented 7 months ago

With the release 2.1.0 we had 0 OOMKilled events. Awesome work, thanks @lgalabru! Unfortunately we suffered a data setback and had to start all over again due to:

{"msg":"unable to insert inscription in location in hord.sqlite: table locations has no column named ordinal_number","level":"ERRO","ts":"2024-02-13T11:34:53.48810074Z"} {"msg":"unable to query hord.sqlite: table locations has no column named ordinal_number","level":"WARN","ts":"2024-02-13T11:34:53.488139256Z"} {"msg":"unable to query hord.sqlite: table locations has no column named ordinal_number","level":"WARN","ts":"2024-02-13T11:34:54.488249366Z"} {"msg":"unable to query hord.sqlite: table locations has no column named ordinal_number","level":"WARN","ts":"2024-02-13T11:34:55.488375249Z"}

Can we get some migration logic to prevent these from happening? Before we collect all of the historical data, a new release will come out and we'll have to reprocess ordinals data again.