tronprotocol / tips

TRON Improvement Proposals
229 stars 203 forks source link

Issue: disconnect with some peer because of not sync for a long time in p2p #427

Closed 317787106 closed 1 year ago

317787106 commented 2 years ago

Simple Summary

We introduce a solution to reduce the log error "Peer xxx not sync for a long time" when fullnode syncs with peer.

Abstract

When one fullnode synchronizes with other peer, there is a blockLock in TronNetDelegate. There are three threads to apply for this lock. sometimes, when one peer canot get this blockLock in ChainInventoryMsgHandler, this peer may be disconnected by thread PeerStatusCheck.check because it finds that this peer's blockBothHaveUpdateTime is blockUpdateTimeout seconds ago than system time. We Introduce a solution to decrease the time cost of acquire blockLock in fullnode that synchronizes with other peer.

Motivation

Sometimes we see the error in tron.log:

00:03:22.898 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /206.221.176.113 not sync for a long time.
00:12:14.924 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /104.194.8.221 not sync for a long time.
00:12:42.926 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /5.9.101.190 not sync for a long time.
00:16:40.980 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /109.238.14.182 not sync for a long time.
00:16:40.981 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /5.9.101.190 not sync for a long time.
00:55:55.277 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.104.153.187 not sync for a long time.
00:56:01.277 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /3.142.169.23 not sync for a long time.
01:07:29.314 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /3.142.169.23 not sync for a long time.
01:09:49.333 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /13.229.124.42 not sync for a long time.
01:17:39.358 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /103.85.253.22 not sync for a long time.
01:17:39.358 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /183.131.184.130 not sync for a long time.
02:05:57.891 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.252.13.200 not sync for a long time.
02:06:09.892 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /18.224.252.203 not sync for a long time.
02:06:09.892 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /51.81.109.193 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.252.6.86 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.237 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.33 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /209.126.87.106 not sync for a long time.
02:07:27.897 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /18.232.41.41 not sync for a long time.
02:11:29.941 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /206.221.176.113 not sync for a long time.
02:11:29.941 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /13.214.179.230 not sync for a long time.
03:22:54.472 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /51.81.184.18 not sync for a long time.
03:22:58.472 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /13.38.128.252 not sync for a long time.
03:23:00.473 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /51.81.67.97 not sync for a long time.
13:26:09.850 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.90.250.85 not sync for a long time.
13:26:13.851 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.83 not sync for a long time.
13:26:15.851 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.33 not sync for a long time.

these peers are disconnected because unexpect reason because "not sync for a long time" by thread PeerStatusCheck.check.

public void statusCheck() {

  long now = System.currentTimeMillis();

  tronNetDelegate.getActivePeer().forEach(peer -> {

    boolean isDisconnected = false;

    if (peer.isNeedSyncFromPeer()
        && peer.getBlockBothHaveUpdateTime() < now - blockUpdateTimeout) {
      logger.warn("Peer {} not sync for a long time.", peer.getInetAddress());
      isDisconnected = true;
    }
  ...
  }
}

So why this peer has not been sync for a long time? It's the same question as why PeerConnection.blockBothHaveUpdateTime has not been updated for long time. blockBothHaveUpdateTime is update in ChainInventoryMsgHandler.processMessage

synchronized (tronNetDelegate.getBlockLock()) {
  while (!peer.getSyncBlockToFetch().isEmpty() && tronNetDelegate
      .containBlock(peer.getSyncBlockToFetch().peek())) {
    try {
      BlockId blockId = peer.getSyncBlockToFetch().pop();
      peer.setBlockBothHave(blockId);
      logger.info("Block {} from {} is processed",
              blockId.getString(), peer.getNode().getHost());
    } catch (NoSuchElementException e) {
      logger.warn("Process ChainInventoryMessage failed, peer {}, isDisconnect:{}",
              peer.getNode().getHost(), peer.isDisconnect());
      return;
    }
  }

we can see if this peer cannot get the blockLock, blockBothHaveUpdateTime cannot be update. blockLock is also used in SyncService.handleSyncBlock

synchronized (tronNetDelegate.getBlockLock()) {
  blockWaitToProcess.forEach((msg, peerConnection) -> {
    ...
    final boolean[] isFound = {false};
    tronNetDelegate.getActivePeer().stream()
        .filter(peer -> msg.getBlockId().equals(peer.getSyncBlockToFetch().peek()))
        .forEach(peer -> {
          peer.getSyncBlockToFetch().pop();
          peer.getSyncBlockInProcess().add(msg.getBlockId());
          isFound[0] = true;
        });
    if (isFound[0]) {
      blockWaitToProcess.remove(msg);
      isProcessed[0] = true;
      processSyncBlock(msg.getBlockCapsule());
    }
  });
}

Suppose we get many blocks from one peer, as processSyncBlock need to write disk, this for cycle may cost a long time.

Implementation

Our solution is to swap the order of getBlockLock and the for cycle:

blockWaitToProcess.forEach((msg, peerConnection) -> {
  synchronized (tronNetDelegate.getBlockLock()) {
    ...
    final boolean[] isFound = {false};
    tronNetDelegate.getActivePeer().stream()
        .filter(peer -> msg.getBlockId().equals(peer.getSyncBlockToFetch().peek()))
        .forEach(peer -> {
          peer.getSyncBlockToFetch().pop();
          peer.getSyncBlockInProcess().add(msg.getBlockId());
          isFound[0] = true;
        });
    if (isFound[0]) {
      blockWaitToProcess.remove(msg);
      isProcessed[0] = true;
      processSyncBlock(msg.getBlockCapsule());
    }
  }
});

This can reduce the total time of lock blockLock in SyncService.handleSyncBlock, so other thread such as ChainInventoryMsgHandler.processMessage can get the lock more esaily.