planetarium / libplanet

Blockchain in C#/.NET for on-chain, decentralized gaming
https://docs.libplanet.io/
GNU Lesser General Public License v2.1
510 stars 144 forks source link

Votes fail for more than three blocks in a row #2840

Closed OnedgeLee closed 1 year ago

OnedgeLee commented 1 year ago

On internal PBFT test, single node was absent on block index between 6016301 to 6016305, though he was online. Seems abnormal, so case study is needed. Internal document

OnedgeLee commented 1 year ago

Timestamp of this node seems to be a bit lazy than other nodes, so need to be careful for timestamp. Timestamp of those block indices are like below.

"index": 6016301, "timestamp": "2023-02-19T16:15:19.850669+00:00"

"index": 6016302, "timestamp": "2023-02-19T16:15:33.112193+00:00"

"index": 6016303, "timestamp": "2023-02-19T16:15:46.165738+00:00"

"index": 6016304, "timestamp": "2023-02-19T16:15:58.661742+00:00"

"index": 6016305, "timestamp": "2023-02-19T16:16:11.199414+00:00"

OnedgeLee commented 1 year ago

The proposer of block height 6016301 was absent, so block has been committed on round 1. This node was successfully participated on the round 0.

OnedgeLee commented 1 year ago

The first message that is added to this validator was PreVoteMsg, and it was added after 2s from round starting. And...when do this validator receive ProposalMsg?

OnedgeLee commented 1 year ago

BlockDemandAsync seems to called repeatedly to same block. seems weird, so investigating about it.

OnedgeLee commented 1 year ago

Seems like this issue has been resolved on https://github.com/planetarium/libplanet/pull/2814. We applied this on our internal validators, but not on external validators. Requested external validators to update, and will check things go on.

OnedgeLee commented 1 year ago

From block height 6039782 ~ 6039787, absent in a row has happened again

OnedgeLee commented 1 year ago

Like @limebell pointed out, Seems like it's hanged on below line. https://github.com/planetarium/libplanet/blob/7a7faba7354f74728fa8a54562e1b8035c66d2bd/Libplanet.Net/Transports/NetMQTransport.cs#L772 There were a few evidences on logs that we've collected.

{
    "@t": "2023-02-26T09:20:06.0006592Z",
    "@mt": "Trying to connect {RequestId}.",
    "@l": "Debug",
    "RequestId": "87a7d002-d4e7-431c-9585-c7ecd82091d4",
    "Source": "NetMQTransport",
    "SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
{
    "@t": "2023-02-26T09:20:11.0019098Z",
    "@mt": "{SocketCount} sockets open for processing request {Message} {RequestId}.",
    "@l": "Debug",
    "SocketCount": 8,
    "Message": "HaveMessage",
    "RequestId": "87a7d002-d4e7-431c-9585-c7ecd82091d4",
    "Subtag": "SocketCount",
    "Tag": "Metric",
    "Source": "NetMQTransport",
    "SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
{
    "@t": "2023-02-26T09:21:53.1838345Z",
    "@mt": "Trying to connect {RequestId}.",
    "@l": "Debug",
    "RequestId": "0f6c64ed-20b7-4df2-9c8f-f2d188ffaae4",
    "Source": "NetMQTransport",
    "SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
{
    "@t": "2023-02-26T09:21:58.1873931Z",
    "@mt": "{SocketCount} sockets open for processing request {Message} {RequestId}.",
    "@l": "Debug",
    "SocketCount": 7,
    "Message": "WantMessage",
    "RequestId": "0f6c64ed-20b7-4df2-9c8f-f2d188ffaae4",
    "Subtag": "SocketCount",
    "Tag": "Metric",
    "Source": "NetMQTransport",
    "SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}

When validator seemed to be hanged, this kind of logs are frequently observed. Trying to connect and sockets open for processing request logs have to be written continuously, but there are some time gaps of several seconds.

This may block below iteration, and ProcessRuntime() task can eventually stuck. https://github.com/planetarium/libplanet/blob/7a7faba7354f74728fa8a54562e1b8035c66d2bd/Libplanet.Net/Transports/NetMQTransport.cs#L720-L736

So, changed those lines with Connect like below,

_logger.Debug("Trying to connect {RequestId}.", req.Id);
await Task.Factory.StartNew(
    async () => dealer.Connect(await req.Peer.ResolveNetMQAddressAsync()), cancellationToken);

and requested node that was suffering from this issue to apply modified image.

If the problem seems to be resolved, PR will be opened.

OnedgeLee commented 1 year ago

Have to investigate log between 6079734 ~ 6079755.

During above period, node seems to be shut down.

OnedgeLee commented 1 year ago

Another case : 6096220 ~ 6096222 https://www.notion.so/planetarium/6052928-6052934-3c2f73f1c8d94930a6b7e886f6e66a45?pvs=4