paradigmxyz / reth

Modular, contributor-friendly and blazing-fast implementation of the Ethereum protocol, in Rust
https://reth.rs/
Apache License 2.0
3.97k stars 1.2k forks source link

Database read transaction has been open for too long #10149

Closed winksaville closed 3 months ago

winksaville commented 3 months ago

Describe the bug

My reth/lighthouse node stopped and on reth I see the following in the systemd logs (see reth.service.logs in the tar.gz file)

Aug 05 11:53:43 ethel reth[1528]: 2024-08-05T18:53:43.834969Z  WARN The database read transaction has been open for too long. Backtrace:
Aug 05 11:53:43 ethel reth[1528]:    0: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    1: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    2: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    3: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    4: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    5: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    6: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    7: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    8: <unknown>
Aug 05 11:53:43 ethel reth[1528]:    9: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   10: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   11: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   12: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   13: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   14: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   15: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   16: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   17: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   18: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   19: <unknown>
Aug 05 11:53:43 ethel reth[1528]:   20: start_thread
Aug 05 11:53:43 ethel reth[1528]:              at ./nptl/pthread_create.c:442:8
Aug 05 11:53:43 ethel reth[1528]:   21: __GI___clone3
Aug 05 11:53:43 ethel reth[1528]:              at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Aug 05 11:53:43 ethel reth[1528]:  open_duration=137.798765174s self.txn_id=2603142
Aug 05 11:53:54 ethel reth[1528]: 2024-08-05T18:53:54.788004Z  INFO Status connected_peers=16 freelist=778508 latest_block=2074149
Aug 05 11:54:08 ethel reth[1528]: 2024-08-05T18:54:08.817555Z  WARN Beacon client online, but no consensus updates received for a while. This may be because of a reth error, or an error in the beacon client! Please investigate reth and beacon client logs! period=173.983707479s
Aug 05 11:54:19 ethel reth[1528]: 2024-08-05T18:54:19.788325Z  INFO Status connected_peers=16 freelist=778508 latest_block=2074149
Aug 05 11:54:44 ethel reth[1528]: 2024-08-05T18:54:44.788334Z  INFO Status connected_peers=16 freelist=778508 latest_block=2074149
Aug 05 11:55:09 ethel reth[1528]: 2024-08-05T18:55:09.788111Z  INFO Status connected_peers=16 freelist=778508 latest_block=2074149
Aug 05 11:55:34 ethel reth[1528]: 2024-08-05T18:55:34.788278Z  INFO Status connected_peers=16 freelist=778508 latest_block=2074149
Aug 05 11:55:59 ethel reth[1528]: 2024-08-05T18:55:59.787909Z  INFO Status connected_peers=16 freelist=778508 latest_block=2074149
Aug 05 11:56:01 ethel systemd[1]: reth.service: Main process exited, code=dumped, status=7/BUS
Aug 05 11:56:01 ethel systemd[1]: reth.service: Failed with result 'core-dump'.
Aug 05 11:56:01 ethel systemd[1]: reth.service: Consumed 37min 14.763s CPU time.
-- Boot 60d1df9338a14f0782ad9c5e5d14238f --
Aug 05 12:30:12 ethel systemd[1]: Starting reth eth2 service...
Aug 05 12:30:14 ethel systemd[1]: Started reth eth2 service.
Aug 05 12:30:14 ethel reth[1568]: 2024-08-05T19:30:14.480414Z  INFO Initialized tracing, debug log directory: /home/kendall/eth2-data/reth/holesky
Aug 05 12:30:14 ethel reth[1568]: 2024-08-05T19:30:14.484785Z  INFO Starting reth version="1.0.1-dev (d5993937)"

I received a message that there was missing attestations from beaconch.in about 1/2 hour later and I decided to shutdown and restart the machine as you can see at the end of the snippet.

After the restart it's running fine.

Steps to reproduce

No idea how to reproduce the issue.

Here is my service file:

kendall@ethel 24-08-06T17:08:50.926Z:~
$ cat /etc/systemd/system/reth.service
# The ethereum service (part of systemd)
# file: /etc/systemd/system/eth1.service

[Unit]
Description     = reth eth2 service
Requires        = network-online.target
After           = network-online.target

[Service]
User            = kendall
ExecStartPre    = /bin/sleep 2
ExecStart       = \
  /home/kendall/bin/reth node \
  --full \
  --chain holesky \
  --instance 1 \
  --max-outbound-peers 16 \
  --max-inbound-peers 16 \
  --http \
  --http.addr 0.0.0.0 \
  --ws \
  --http.api admin,eth,web3,net \
  --datadir /home/kendall/eth2-data/reth/holesky \
  --authrpc.addr 127.0.0.1 \
  --authrpc.port 8551 \
  --authrpc.jwtsecret /home/kendall/eth2-data/secrets/jwtsecret \
  --log.file.format terminal \
  --log.file.filter debug \
  --log.file.directory /home/kendall/eth2-data/reth \
  --log.file.max-size 200 \
  --log.file.max-files 5 \
  --log.journald \
  --log.journald.filter error \
  --color auto \
  --metrics 0.0.0.0:9001 \

Restart         = no
RestartSec      = 3
TimeoutSec      = 300

[Install]
WantedBy    = multi-user.target
kendall@ethel 24-08-06T17:08:59.255Z:~

Here is other info about my system:

kendall@ethel 24-08-06T17:15:24.096Z:~
$ print-info.sh 

~/bin/reth --version:
reth Version: 1.0.1-dev
Commit SHA: d599393771f9d7d137ea4abf271e1bd118184c73
Build Timestamp: 2024-07-10T21:23:32.330216026Z
Build Features: jemalloc
Build Profile: release

~/bin/lighthouse --version:
Lighthouse v5.2.1-9e12c21
BLS library: blst-portable
SHA256 hardware acceleration: false
Allocator: jemalloc
Profile: maxperf
Specs: mainnet (true), minimal (false), gnosis (true)

inxi:
CPU: quad core Intel Core i5-8259U (-MT MCP-) speed/min/max: 900/400/3800 MHz
Kernel: 6.5.0-45-generic x86_64 Up: 21h 47m Mem: 6427.8/31953.5 MiB (20.1%)
Storage: 2.73 TiB (25.7% used) Procs: 249 Shell: Bash inxi: 3.3.13

lsb_release -a:
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:    22.04
Codename:   jammy

df -h:
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.2G  1.9M  3.2G   1% /run
/dev/nvme0n1p2  916G  140G  730G  17% /
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        192K   59K  129K  32% /sys/firmware/efi/efivars
/dev/nvme0n1p1  511M  6.1M  505M   2% /boot/efi
/dev/sda1       1.8T  580G  1.2T  34% /home/kendall/eth2-data
tmpfs           3.2G  108K  3.2G   1% /run/user/125
tmpfs           3.2G   96K  3.2G   1% /run/user/1000

apt list --upgradeable:
Listing... Done
chrome-remote-desktop/stable 128.0.6613.9 amd64 [upgradable from: 127.0.6533.8]
curl/jammy-updates,jammy-security 7.81.0-1ubuntu1.17 amd64 [upgradable from: 7.81.0-1ubuntu1.16]
libcurl3-gnutls/jammy-updates,jammy-security 7.81.0-1ubuntu1.17 amd64 [upgradable from: 7.81.0-1ubuntu1.16]
libcurl4/jammy-updates,jammy-security 7.81.0-1ubuntu1.17 amd64 [upgradable from: 7.81.0-1ubuntu1.16]
python3-update-manager/jammy-updates,jammy-updates 1:22.04.20 all [upgradable from: 1:22.04.19]
update-manager-core/jammy-updates,jammy-updates 1:22.04.20 all [upgradable from: 1:22.04.19]
update-manager/jammy-updates,jammy-updates 1:22.04.20 all [upgradable from: 1:22.04.19]

uptime:
 10:17:19 up 21:47,  2 users,  load average: 1.33, 2.10, 2.25

free:
               total        used        free      shared  buff/cache   available
Mem:            31Gi       5.5Gi       2.3Gi       257Mi        23Gi        24Gi
Swap:          2.0Gi       0.0Ki       2.0Gi

sensors:
coretemp-isa-0000
Package id 0:  +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
temp1:        +57.0°C  
temp1:       -263.2°C  
temp2:        +27.8°C  (crit = +119.0°C)
temp1:        +49.0°C  

smartctl:
Temperature:                        44 Celsius
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               45 Celsius
kendall@ethel 24-08-06T17:17:19.357Z:~

Node logs

The file also includes lighthouse beacon-chain and validator logs as well as the logs from systemd. The file is named 202407051930-ethel-db-read-open-too-long.tar.gz and can be downloaded at https://drive.google.com/file/d/1q-93CzeJ8u4n15YnJ2Ny2CHE4BGVAC-t/view?usp=sharing

Platform(s)

Linux (x86)

What version/commit are you on?

kendall@ethel 24-08-06T17:01:07.246Z:~
$ reth --version
reth Version: 1.0.1-dev
Commit SHA: d599393771f9d7d137ea4abf271e1bd118184c73
Build Timestamp: 2024-07-10T21:23:32.330216026Z
Build Features: jemalloc
Build Profile: release
kendall@ethel 24-08-06T17:01:15.416Z:~

What database version are you on?

kendall@ethel 24-08-06T17:10:51.102Z:~
$ reth db --datadir /home/kendall/eth2-data/reth/holesky version
2024-08-06T17:11:27.767372Z  INFO Initialized tracing, debug log directory: /home/kendall/.cache/reth/logs/mainnet
Current database version: 2
Local database version: 2
kendall@ethel 24-08-06T17:11:27.782Z:~

Which chain / network are you on?

holesky

kendall@ethel 24-08-06T17:04:34.924Z:~
$ sudo systemctl --no-pager status reth.service 
● reth.service - reth eth2 service
     Loaded: loaded (/etc/systemd/system/reth.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-08-05 12:30:14 PDT; 21h ago
    Process: 1197 ExecStartPre=/bin/sleep 2 (code=exited, status=0/SUCCESS)
   Main PID: 1568 (reth)
      Tasks: 29 (limit: 38253)
     Memory: 5.6G
        CPU: 38min 13.523s
     CGroup: /system.slice/reth.service
             └─1568 /home/kendall/bin/reth node --full --chain holesky --instance 1 --max-outbound-peers 16 --max-inbound-peers 16 --http --http.addr 0.0.0.0 --ws --http.api admin,eth,…

Aug 06 10:03:37 ethel reth[1568]: 2024-08-06T17:03:37.742460Z  INFO Pruner finished tip_block_number=2080335 elapsed=62.933431ms stats=[(AccountHistory, 201, Finished), …559, Finished)]
Aug 06 10:03:52 ethel reth[1568]: 2024-08-06T17:03:52.642969Z  INFO Block added to canonical chain number=2080336 hash=0x47ad5c199403350cef97dfebb02b120fa21e8e00c371249a…sed=73.639077ms
Aug 06 10:03:52 ethel reth[1568]: 2024-08-06T17:03:52.946610Z  INFO Canonical chain committed number=2080336 hash=0x47ad5c199403350cef97dfebb02b120fa21e8e00c371249a951ec…sed=95.252944ms
Aug 06 10:04:01 ethel reth[1568]: 2024-08-06T17:04:01.488632Z  INFO Block added to fork chain number=2080336 hash=0xa55227ab479c398b29ff31caa6964ebca59ac63936a33771f9cc65720e70888b
Aug 06 10:04:03 ethel reth[1568]: 2024-08-06T17:04:03.121912Z  INFO REORG: revert canonical from database by unwinding chain blocks 2080336..=2080336
Aug 06 10:04:03 ethel reth[1568]: 2024-08-06T17:04:03.321823Z  INFO Canonical chain committed number=2080336 hash=0xa55227ab479c398b29ff31caa6964ebca59ac63936a33771f9cc6…ed=199.959083ms
Aug 06 10:04:13 ethel reth[1568]: 2024-08-06T17:04:13.552324Z  INFO Block added to canonical chain number=2080337 hash=0x7688d9e6e9696785932dacaaea2be9963f6d6cd78ec9add8…sed=53.587306ms
Aug 06 10:04:13 ethel reth[1568]: 2024-08-06T17:04:13.810532Z  INFO Canonical chain committed number=2080337 hash=0x7688d9e6e9696785932dacaaea2be9963f6d6cd78ec9add855afc…sed=54.270963ms
Aug 06 10:04:37 ethel reth[1568]: 2024-08-06T17:04:37.835783Z  INFO Block added to canonical chain number=2080338 hash=0xfc00bae7f3eb129c43af1bd02adfaeeb3af604c43f2f91bd…sed=155.22288ms
Aug 06 10:04:38 ethel reth[1568]: 2024-08-06T17:04:38.198931Z  INFO Canonical chain committed number=2080338 hash=0xfc00bae7f3eb129c43af1bd02adfaeeb3af604c43f2f91bd48d60…ed=154.227906ms
Hint: Some lines were ellipsized, use -l to show in full.
kendall@ethel 24-08-06T17:04:44.135Z:~

What type of node are you running?

Full via --full flag

What prune config do you use, if any?

AFAIK I'm not using a reth.toml

If you've built Reth from source, provide the full command you used

argo install --locked --path bin/reth --bin reth

Code of Conduct

mattsse commented 3 months ago

@shekhirin @joshieDo this looks like some kind of db thread crash

winksaville commented 3 months ago

Another thought was that there might be a locking/race problem, pure speculation on my part.

mattsse commented 3 months ago

also possible,

@joshieDo just fixed a deadlock in #10034 and #10069 which is out now in v1.0.4

winksaville commented 3 months ago

also possible,

@joshieDo just fixed a deadlock in #10034 and #10069 which is out now in v1.0.4

I'm building v1.0.4 now any reason NOT to upgrade, for instance you'd want to do an experiment with this version?

winksaville commented 3 months ago

also possible, @joshieDo just fixed a deadlock in #10034 and #10069 which is out now in v1.0.4

I'm building v1.0.4 now any reason NOT to upgrade, for instance you'd want to do an experiment with this version?

I went ahead and updated reth:

Aug 06 14:24:55 ethel reth[2211]: 2024-08-06T21:24:55.143625Z  INFO Starting reth version="1.0.4-dev (e24e4c77)"
joshieDo commented 3 months ago

Did you face this issue again?

winksaville commented 3 months ago

Did you face this issue again?

No