Closed MrFreezeDZ closed 5 days ago
maybe same thing as https://github.com/ledgerwatch/erigon/issues/10734
try
integration state_stages --unwind=10
integration stage_headers --unwind=10
start erigon
Thank you very much for the quick response. We ran the commands and it seems to work. Right now Erigon is running and is catching up again. The memory consumption graph also looks normal again. I think the problem is solved. There was an ERROR-message in the integration logs at 09:46:25.956 but reading the messages I think everything worked correct. We executed a "print_stages" before, after and between the two given integration commands out of curiosity. Perhaps the integration logs will help to find the cause?
integration --chain bor-mainnet --datadir /data --bor.heimdall http://heimdallrest print_stages
INFO[06-24|09:41:55.746] logging to file system log dir=/data/logs file prefix=integration log level=info json=false
INFO[06-24|09:41:55.770] [db] open lable=chaindata sizeLimit=12TB pageSize=16384
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.
stage_at prune_at
Snapshots 58494359 0
Headers 58494419 0
BorHeimdall 58494419 0
BlockHashes 58494419 0
Bodies 58494419 0
Senders 58494419 0
Execution 58494419 58494419
Translation 0 0
HashState 58494359 0
IntermediateHashes 58494359 58494359
AccountHistoryIndex 58494359 0
StorageHistoryIndex 58494359 0
LogIndex 58494359 0
CallTraces 58494359 58494359
TxLookup 58494359 0
Finish 58494359 0
--
prune distance:
blocks.v2: false, blocks=0, segments=0, indices=0
blocks.bor.v2: segments=0, indices=0
history.v3: false, idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)
sequence: EthTx=4274791252, NonCanonicalTx=1534519
in db: first header 1, last header 58543547, first body 1, last body 58543547
--
/ $ integration --chain bor-mainnet --datadir /data --bor.heimdall http://heimdallrest state_stages --unwind=10
INFO[06-24|09:42:36.758] logging to file system log dir=/data/logs file prefix=integration log level=info json=false
INFO[06-24|09:42:36.766] [db] open lable=chaindata sizeLimit=12TB pageSize=16384
INFO[06-24|09:42:36.865] Opening Database label=bor path=/data/bor
INFO[06-24|09:42:47.859] [8/15 HashState] Promoting plain state from=58494359 to=58494419
INFO[06-24|09:42:47.859] [8/15 HashState] Incremental promotion from=58494359 to=58494419 codes=true csbucket=AccountChangeSet
INFO[06-24|09:43:07.530] [8/15 HashState] Incremental promotion from=58494359 to=58494419 codes=false csbucket=AccountChangeSet
INFO[06-24|09:43:23.295] [8/15 HashState] Incremental promotion from=58494359 to=58494419 codes=false csbucket=StorageChangeSet
INFO[06-24|09:44:08.293] [8/15 HashState] ETL [2/2] Loading into=HashedStorage current_prefix=e9dae3d7
INFO[06-24|09:44:11.642] [8/15 HashState] DONE in=1m23.783196667s
INFO[06-24|09:44:11.643] [9/15 IntermediateHashes] Generating intermediate hashes from=58494359 to=58494419
INFO[06-24|09:44:41.772] [9/15 IntermediateHashes] Calculating Merkle root current key=499a35fd
INFO[06-24|09:45:11.769] [9/15 IntermediateHashes] Calculating Merkle root current key=94788105
INFO[06-24|09:45:41.872] [9/15 IntermediateHashes] Calculating Merkle root current key=e0967326
INFO[06-24|09:46:16.373] [9/15 IntermediateHashes] Calculating Merkle root current key=e9dae3d7
EROR[06-24|09:46:25.956] [9/15 IntermediateHashes] Wrong trie root of block 58494419: f5b101a3788642d0199d7b1fe0493dd82340a590316967f8553b72f948900407, expected (from header): d5e06ab51fcddfa0887c3db8305aecf1f29928bc2726a2f59611a8a5674e166b. Block hash: 42e6674c65b0bf044670324585ee11a33a3142b6c5e10843a16adb6441e94f7d
WARN[06-24|09:46:25.957] Unwinding due to incorrect root hash to=58494389
INFO[06-24|09:46:25.957] [9/15 IntermediateHashes] DONE in=2m14.314088045s
INFO[06-24|09:46:25.957] [8/15 HashState] Unwinding started from=58494419 to=58494389 storage=false codes=true
INFO[06-24|09:46:25.977] [8/15 HashState] Unwinding started from=58494419 to=58494389 storage=false codes=false
INFO[06-24|09:46:26.019] [8/15 HashState] Unwinding started from=58494419 to=58494389 storage=true codes=false
INFO[06-24|09:46:26.169] [7/15 Execution] Unwind Execution from=58494419 to=58494389
/ $ integration --chain bor-mainnet --datadir /data --bor.heimdall http://heimdallrest print_stages
INFO[06-24|09:46:45.884] logging to file system log dir=/data/logs file prefix=integration log level=info json=false
INFO[06-24|09:46:45.891] [db] open lable=chaindata sizeLimit=12TB pageSize=16384
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.
stage_at prune_at
Snapshots 58494359 0
Headers 58494419 0
BorHeimdall 58494389 0
BlockHashes 58494419 0
Bodies 58494419 0
Senders 58494419 0
Execution 58494389 58494419
Translation 0 0
HashState 58494389 0
IntermediateHashes 58494359 58494359
AccountHistoryIndex 58494359 0
StorageHistoryIndex 58494359 0
LogIndex 58494359 0
CallTraces 58494359 58494359
TxLookup 58494359 0
Finish 58494359 0
--
prune distance:
blocks.v2: false, blocks=0, segments=0, indices=0
blocks.bor.v2: segments=0, indices=0
history.v3: false, idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)
sequence: EthTx=4274791252, NonCanonicalTx=1534519
in db: first header 1, last header 58543547, first body 1, last body 58543547
--
/ $ integration --chain bor-mainnet --datadir /data --bor.heimdall http://heimdallrest stage_headers --unwind=10
INFO[06-24|09:47:11.968] logging to file system log dir=/data/logs file prefix=integration log level=info json=false
INFO[06-24|09:47:11.974] [db] open lable=chaindata sizeLimit=12TB pageSize=16384
INFO[06-24|09:47:12.031] Opening Database label=bor path=/data/bor
INFO[06-24|09:47:32.770] TruncateBlocks block=58498042
INFO[06-24|09:47:52.769] TruncateBlocks block=58501711
INFO[06-24|09:48:12.785] TruncateBlocks block=58505520
INFO[06-24|09:48:32.773] TruncateBlocks block=58509740
INFO[06-24|09:48:52.771] TruncateBlocks block=58513827
INFO[06-24|09:49:12.772] TruncateBlocks block=58517661
INFO[06-24|09:49:32.778] TruncateBlocks block=58522270
INFO[06-24|09:49:52.769] TruncateBlocks block=58526813
INFO[06-24|09:50:12.774] TruncateBlocks block=58530811
INFO[06-24|09:50:32.770] TruncateBlocks block=58534131
INFO[06-24|09:50:52.770] TruncateBlocks block=58537614
INFO[06-24|09:51:12.772] TruncateBlocks block=58540927
INFO[06-24|09:51:30.842] Progress headers=58494409
/ $ integration --chain bor-mainnet --datadir /data --bor.heimdall http://heimdallrest print_stages
INFO[06-24|09:51:42.270] logging to file system log dir=/data/logs file prefix=integration log level=info json=false
INFO[06-24|09:51:42.276] [db] open lable=chaindata sizeLimit=12TB pageSize=16384
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.
stage_at prune_at
Snapshots 58494359 0
Headers 58494409 0
BorHeimdall 58494389 0
BlockHashes 58494419 0
Bodies 58494409 0
Senders 58494419 0
Execution 58494389 58494419
Translation 0 0
HashState 58494389 0
IntermediateHashes 58494359 58494359
AccountHistoryIndex 58494359 0
StorageHistoryIndex 58494359 0
LogIndex 58494359 0
CallTraces 58494359 58494359
TxLookup 58494359 0
Finish 58494359 0
--
prune distance:
blocks.v2: false, blocks=0, segments=0, indices=0
blocks.bor.v2: segments=0, indices=0
history.v3: false, idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)
sequence: EthTx=4274791252, NonCanonicalTx=1534519
in db: first header 1, last header 58494409, first body 1, last body 58494409
System information
Erigon version:
./erigon --version
2.59.3OS & Version: Windows/Linux/OSX Kubernetes, Image from here: https://hub.docker.com/r/thorax/erigon The Erigon container has resources of 16 CPUs and 112Gi Memory.
Commit hash: 088fd8ef69389a72da6faa0fc7903a4ba5726911
Erigon Command (with flags/config): erigon --chain=bor-mainnet --datadir=/data/ --log.json=true --http.addr=0.0.0.0 --http.vhosts= --http --http.api=eth,admin,debug,net,trace,web3,erigon,txpool --ws --bor.heimdall=http://heimdallrest --authrpc.vhosts= --authrpc.jwtsecret=/secret/jwt.hex --authrpc.addr=0.0.0.0 --db.size.limit=12TB --db.pagesize=16k --metrics --metrics.addr=0.0.0.0 --maxpeers=500 --torrent.download.rate=300mb --bootnodes=enode://bdcd4786a616a853b8a041f53496d853c68d99d54ff305615cd91c03cd56895e0a7f6e9f35dbf89131044e2114a9a782b792b5661e3aff07faf125a98606a071@43.200.206.40:30303,enode://209aaf7ed549cf4a5700fd833da25413f80a1248bd3aa7fe2a87203e3f7b236dd729579e5c8df61c97bf508281bae4969d6de76a7393bcbd04a0af70270333b3@54.216.248.9:30303
Consensus Layer: Heimdall 1.0.7
Consensus Layer Command (with flags/config): /usr/bin/heimdalld start --home=/heimdall-home
Chain/Network: bor-mainnet
Expected behaviour
Erigon will keep working and a bad header should not make erigon log it every few milliseconds until Erigon gets OOMKilled by Kubernetes.
Actual behaviour
Erigon logs the following:
Steps to reproduce the behaviour
I do not know how to actively reproduce this behaviour. Erigon was updated to 2.59.3 on the last friday.
Backtrace
These logs are copy pasted from the google cloud UI so the timestamps before each line are from googles logs explorer. At 2024-06-23 07:18:40.271 the logs get filled with the "Rejected header marked as bad" message. In the Memory usage we can see that the memory gets used more and more until Kubernetes kills the pod.
Please help us, as this is our productive Erigon instance.