hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu
https://www.hyperledger.org/projects/besu
Apache License 2.0
1.53k stars 848 forks source link

P2P DNS/TCP Burst Issue on Mainnet and Goerli with Docker #4375

Open siladu opened 2 years ago

siladu commented 2 years ago

Report from Discord user KuDeTa: https://discord.com/channels/905194001349627914/1014169639258964081/1016969914805927987

Neither Goerli nor mainnet will sync reliably for me Geth, erigon, prysm, teku all running perfectly well in the same stack.

If i delete the whole database, it does OK for a little while then drops I see these DNS resolver messages sometimes too:

2022-09-07 07:24:36.608+00:00 | 
Timer-0 | WARN  | DNSResolver | I/O exception contacting remote DNS server when resolving OSBNAM4I3ZWCOLC4QLPNVYK4C4.all.mainnet.ethdisco.net
java.io.IOException: Timed out while trying to resolve OSBNAM4I3ZWCOLC4QLPNVYK4C4.all.mainnet.ethdisco.net./TXT, id=4930
at org.xbill.DNS.Resolver.send(Resolver.java:170)

I saw someone else post that their pi hole (DNS server) was seeing thousands of DNS requests a second, mine is too. I was kind of wondering if there is an issue with docker and the volume of DNS traffic. I haven't checked exactly what is going on with my router (Unifi gear), but this looks like it is actually ddosing my network stack All i can say right now is that every time i start this, it seems to break my DNS server (pi hole) and bring my router to it's knees.

Screenshot_2022-09-07_at_08 26 08 Screenshot_2022-09-07_at_08 35 37

...and it's only besu that has DNS bursts like this?

It's certainly only besu that seems to burst so hard and then complain it can't find peers But my pi.hole is set to allow 2000/min and even that is being hit. And my router is a UXG-PRO - prosumer (commercial) grade. But clearly this isn't a widespread issue, so could docker be the real problem somewhere here? I will try that Xdns stuff and try > bypassing the local dns server to see if either improve it

Neither Xdns or getting rid of the local DNS server helped. Router continues to buckle under the weight of traffic as soon as the service is started - it needs some proper investigation.

Sep  7 11:32:14 UXGPro user.info ubios-udapi-server: wan-failover-interfaces: wf-interface-ppp0 (my ip
) is down [_DD___](no dns)

It looks like that TCP_tw traffic is the issue I'm at the limit of my understand right there, but no other node software causes that kind of burst

Screenshot_2022-09-07_at_13 22 16
Total: 355
TCP:   1264 (estab 6, closed 1221, orphaned 18, timewait 599)

Transport Total     IP        IPv6
RAW      1         0         1
UDP      35        28        7
TCP      43        31        12
INET      79        59        20
FRAG      0         0         0

(on starting besu)

Some debug logs attached

mainnet-debug-logs.txt

A lot of these logs which may be related?

2022-09-08 16:17:29.203+00:00 | nioEventLoopGroup-3-1 | DEBUG | AbstractPeerConnection | Terminating connection 1376930571, reason 0x01 TCP_SUBSYSTEM_ERROR
2022-09-08 16:17:29.203+00:00 | nioEventLoopGroup-3-1 | DEBUG | AbstractPeerConnection | Terminating connection 1376930571, reason 0x10 SUBPROTOCOL_TRIGGERED

and

2022-09-08 16:19:50.730+00:00 | nioEventLoopGroup-3-3 | DEBUG | AbstractHandshakeHandler | Handshake error:
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)
at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:233)
at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)

UPGRADED TO 22.7.2

OK, testing this identical config on goerli, using a completely separate box, but also using docker within the same networking stack: results are identical to mainnet: Massive TCP spike. Router keels over, no peers, etc.

besugoerli.txt (besu mainnet log file size too big for github, see: https://discord.com/channels/905194001349627914/1014169639258964081/1017548263852888195)

And if you want some flavour of what i go through every time i start besu, here is a good screenshot :S

Screenshot_2022-09-08_at_22 36 30

Is your router acting as your DNS provider? Maybe you could change that to something else if the router is struggling?

I've tried turning off my pi.hole, it made no difference Whatever is going on here, it seems like it should be escalated as much as you can - i've never come across a piece of software that can bork an entire network. I don't think i can blame the UXG-PRO (https://store.ui.com/products/unifi-next-generation-gateway-professional) - it should be capable of handling hundreds of concurrent users and many servers - but i'm happy to try and help isolate the cause of this. I guess docker is the prime suspect atm.

did you try lowering the max-peers?

Yeah i went back down to defaults at various points.

Hardware

hexacore (12 thread) NUC (i7-FNH):

Screenshot_2022-09-08_at_22 00 02

No noticeable blips from node exporter at any of the times i've started besu

Docker configuration

mainnet

services:
  el-besu:
    stdin_open: true
    tty: true
    container_name: el-besu
    image: hyperledger/besu:latest
    volumes:
      - /archive/el-besu:/var/lib/besu
      - /home/ethereum/secrets:/secrets
    restart: unless-stopped
    ports: # add 8545:8545 for RPC
      - "50303:50303"
      - "50303:50303/udp"
    networks:
      - ethereum
    command: >
      --data-path=/var/lib/besu
      --rpc-http-enabled
      --rpc-http-api="WEB3,ETH,NET"
      --rpc-http-host="0.0.0.0"
      --rpc-http-port=8545
      --rpc-http-cors-origins=*
      --rpc-http-max-active-connections=65536
      --rpc-ws-enabled
      --rpc-ws-api="WEB3,ETH,NET,ADMIN"
      --rpc-ws-host="0.0.0.0"
      --rpc-ws-port=8546
      --p2p-port=50303
      --max-peers=40
      --fast-sync-min-peers=5
      --host-allowlist=*
      --engine-host-allowlist=*
      --engine-jwt-secret=/secrets/jwt
      --engine-rpc-port=8551
      --data-storage-format=BONSAI
      --sync-mode=X_SNAP
      --nat-method=DOCKER
      --p2p-host=<myip>
      --p2p-interface=0.0.0.0
    stop_grace_period: 10m
#testnet config
version: "3.5"

services:
  el-besu:
    stdin_open: true
    tty: true
    container_name: el-besu
    image: hyperledger/besu:latest
    volumes:
      - /home/el-besu:/var/lib/besu
      - /home/ethereum/secrets:/secrets
    restart: unless-stopped
    ports: # add 8545:8545 for RPC
      - "50304:50304"
      - "50304:50304/udp"
    networks:
      - ethereum
    command: >
      --network=goerli
      --data-path=/var/lib/besu
      --rpc-http-enabled
      --rpc-http-api="WEB3,ETH,NET,ADMIN"
      --rpc-http-host="0.0.0.0"
      --rpc-http-port=8545
      --rpc-http-cors-origins=*
      --rpc-ws-enabled
      --rpc-ws-api="WEB3,ETH,NET,ADMIN"
      --rpc-ws-host="0.0.0.0"
      --rpc-ws-port=8546
      --p2p-port=50304
      --max-peers=40
      --fast-sync-min-peers=5
      --host-allowlist=*
      --engine-host-allowlist=*
      --engine-jwt-secret=/secrets/jwt
      --engine-rpc-port=8551
      --data-storage-format=BONSAI
      --sync-mode=X_SNAP
      --logging=DEBUG
      --nat-method=DOCKER
      --p2p-host=<ip>
      --p2p-interface=0.0.0.0
      --Xdns-enabled=true
    stop_grace_period: 10m

networks:
  ethereum:
    name: ethereum
    driver: bridge
non-fungible-nelson commented 1 year ago

Any status?