Websocket hang after error while notifying subscription

wojski commented 1 year ago

System information

Erigon version: erigon/2022.09.3/linux-amd64/go1.19.1

OS & Version:Linux

Commit hash : https://hub.docker.com/layers/thorax/erigon/v2022.09.03/images/sha256-a8d14345cf3b8f089e38b660508a11d9d7c7dd659d6906372aab187fbc8ee083?context=explore

Expected behaviour

Websocket with message:

{"jsonrpc":"2.0","id": 1, "method": "eth_subscribe", "params": ["newHeads"]}

Should return new blocks.

Actual behaviour

After some time no new messages came from WS.

In rpcDaemon I can find log: [WARN] [10-03|21:31:53.506] error while notifying subscription err="write tcp 172.22.0.10:8545->xxxxxxxx:55508: write: connection timed out"

RPC endpoint working correctly. WS is no longer responsible (I can connect to it, but no messages received) Its working again after restart container.

RPCNode:

  erigon:
    image: thorax/erigon:v2022.09.03
    container_name: erigon_eth
    command: erigon --metrics  --metrics.addr=0.0.0.0 --private.api.addr=0.0.0.0:9090 --metrics.port=6060 --pprof --pprof.addr=0.0.0.0 --pprof.port=6061
             --authrpc.jwtsecret=/home/lighthouse/jwtsecret.hex --authrpc.port=8551 --authrpc.addr="0.0.0.0" --authrpc.vhosts="*"
             --override.terminaltotaldifficulty=0xc70d808a128d7380000 --db.read.concurrency=2000 --rpc.batch.concurrency=10000
    volumes:
      - /srv/crypto/erigon/eth/data:/home/erigon/.local/share/erigon
      - /srv/crypto/erigon/eth/lighthouse:/home/lighthouse
    ports:
      - "30303:30303/tcp"
      - "30303:30303/udp"
      - "30304:30304/tcp"
      - "30304:30304/udp"
      - "9091:9090"
    restart: unless-stopped

RPCDeamon:

rpcdaemon:
    image: thorax/erigon:v2022.09.03
    container_name: erigon_eth_rpcdaemon_1
    command: rpcdaemon --datadir=/home/erigon/.local/share/erigon --private.api.addr=erigon:9090  --http.addr=0.0.0.0 --http.vhosts=* --http.corsdomain=* --verbosity=2 --rpc.batch.concurrency=8
             --http.api=eth,debug,net,web3,trace --ws --db.read.concurrency=2000 --rpc.batch.concurrency=60000 --ws.compression=true
    pid: service:erigon # Use erigon's PID namespace. It's required to open Erigon's DB from another process (RPCDaemon local-mode)
    volumes:
      - /srv/crypto/erigon/eth/data:/home/erigon/.local/share/erigon
    ports:
      - "8550:8545"
    restart: unless-stopped

Steps to reproduce the behaviour

Use WS for some time with:

{"jsonrpc":"2.0","id": 1, "method": "eth_subscribe", "params": ["newHeads"]}

Backtrace

[WARN] [10-03|21:31:53.506] error while notifying subscription       err="write tcp 172.22.0.10:8545->**xxxxxxxx**:55508: write: connection timed out"

(Same issue on 09.01)

0x-3FFF commented 1 year ago

problem should be solved by PR #5734

sdecker747 commented 1 year ago

I believe I am still experiencing a similar issue still on 2.29.0-stable-011a54a2.

After a little while, I am no longer able to get new blocks or new pending blocks via websocket subscription and have the message "error while notifying subscription" in my rpc daemon log.

I've noticed it seems to specifically happen when I am requesting txpool_contents.

adrianchiforwm commented 1 year ago

We're still seeing this in 2.30.0

Hex-Capital commented 1 year ago

I am still seeing this error as well

BlinkyStitt commented 1 year ago

Weird. I've been subscribing to new heads without issue after https://github.com/ledgerwatch/erigon/pull/5734 (reported in https://github.com/ledgerwatch/erigon/issues/5577) and merged into v2.28.1.

(I'm currently running v2.32.0)

elee1766 commented 1 year ago

it would be useful to get the pprof goroutine trace if possible when this happens to your node.

--prof --pprof.port=6060 --pprof.addr=0.0.0.0 and then go to http://node-ip:6060/debug/pprof and download the "full goroutine stack dump"

also if possible - run the rpcdaemon separately so there is less noise in the file.

koray's node seem to have locked here: https://github.com/ledgerwatch/erigon/blob/devel/turbo/rpchelper/filters.go#L526

goroutine 82 [chan send, 29 minutes]:
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).OnNewEvent(0xc08c45c680, 0xc00007e120?)
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:526 +0x32b
github.com/ledgerwatch/erigon/cmd/rpcdaemon/rpcservices.(*RemoteBackend).Subscribe(0xc08c443310, {0x1e22b68, 0xc0013c5a00}, 0xc025a56c00)
    github.com/ledgerwatch/erigon/cmd/rpcdaemon/rpcservices/eth_backend.go:141 +0x1d5
github.com/ledgerwatch/erigon/turbo/rpchelper.New.func1()
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:84 +0xe5
created by github.com/ledgerwatch/erigon/turbo/rpchelper.New
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:73 +0x33f

since OnNewEvent seems to read lock mutex, stopping new subscribers from being added.

it could be multiple issues that cause same issue - the filters mutex goes into deadlock, perhaps it is partially fixed, but still other cases where it is not?

https://github.com/ledgerwatch/erigon/blob/devel/turbo/rpchelper/filters.go#L396

it seems to also be stuck here, so this is with the patch

goroutine 29199 [semacquire, 18 minutes]:
sync.runtime_SemacquireMutex(0x42a8a5?, 0x60?, 0x0?)
    runtime/sema.go:71 +0x25
sync.(*RWMutex).Lock(0x1557540?)
    sync/rwmutex.go:144 +0x71
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).UnsubscribePendingTxs(0xc08c45c680, {0xc084710ec0, 0x40})
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:396 +0xe6
github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands.(*APIImpl).NewPendingTransactions.func1()
    github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands/eth_filters.go:188 +0x378
created by github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands.(*APIImpl).NewPendingTransactions
    github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands/eth_filters.go:174 +0xdc

there must be another deadlock condition not being caught.

elee1766 commented 1 year ago

update:

goroutine 44 [chan send, 97 minutes]:
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).OnNewTx(0xc063ebba00, 0xc07d3cb000)
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:580 +0x171
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).subscribeToPendingTransactions(0xc000df7a00?, {0x1e22b68, 0xc000df7a00}, {0x1e2ba80, 0xc05e435df0})
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:216 +0x125
github.com/ledgerwatch/erigon/turbo/rpchelper.New.func3()
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:132 +0x95
created by github.com/ledgerwatch/erigon/turbo/rpchelper.New
    github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:125 +0x48a

another one, same issue this time on the other filter. seems something is wrong with this logic for sending to all the subs.

0x-3FFF commented 1 year ago

Thanks for the advises from @elee1766 , i would keep tracking this issue in the future

koraykoska commented 1 year ago

Isn't this already fixed? I am not experiencing it any more.

BlinkyStitt commented 1 year ago

One of the causes is fixed. But there appears to be at least one additional cause. I experienced the deadlock on some polygon nodes. They are running the maticnetwork fork, but that fork has this PR in it. Is there some command I should run if I see the deadlock happen again?On Jan 16, 2023, at 7:17 AM, Koray Koska @.***> wrote: Isn't this already fixed? I am not experiencing it any more.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

koraykoska commented 1 year ago

Why is this issue still open? That's worrying

elee1766 commented 1 year ago

Why is this issue still open? That's worrying

so i fixed one cause by redoing the subscriptino engine, but it seems there is still something else ?

@WyseNynja is it still happening? if so, yes. if you run your node with pprof the debug endpoint, /debug/pprof/goroutine?debug=1 is super useful

ledgerwatch / erigon