Closed OnedgeLee closed 1 year ago
Timestamp of this node seems to be a bit lazy than other nodes, so need to be careful for timestamp. Timestamp of those block indices are like below.
"index": 6016301, "timestamp": "2023-02-19T16:15:19.850669+00:00"
"index": 6016302, "timestamp": "2023-02-19T16:15:33.112193+00:00"
"index": 6016303, "timestamp": "2023-02-19T16:15:46.165738+00:00"
"index": 6016304, "timestamp": "2023-02-19T16:15:58.661742+00:00"
"index": 6016305, "timestamp": "2023-02-19T16:16:11.199414+00:00"
The proposer of block height 6016301
was absent, so block has been committed on round 1.
This node was successfully participated on the round 0.
The first message that is added to this validator was PreVoteMsg
, and it was added after 2s from round starting.
And...when do this validator receive ProposalMsg
?
BlockDemandAsync
seems to called repeatedly to same block. seems weird, so investigating about it.
Seems like this issue has been resolved on https://github.com/planetarium/libplanet/pull/2814. We applied this on our internal validators, but not on external validators. Requested external validators to update, and will check things go on.
From block height 6039782
~ 6039787
, absent in a row has happened again
Like @limebell pointed out, Seems like it's hanged on below line. https://github.com/planetarium/libplanet/blob/7a7faba7354f74728fa8a54562e1b8035c66d2bd/Libplanet.Net/Transports/NetMQTransport.cs#L772 There were a few evidences on logs that we've collected.
{
"@t": "2023-02-26T09:20:06.0006592Z",
"@mt": "Trying to connect {RequestId}.",
"@l": "Debug",
"RequestId": "87a7d002-d4e7-431c-9585-c7ecd82091d4",
"Source": "NetMQTransport",
"SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
{
"@t": "2023-02-26T09:20:11.0019098Z",
"@mt": "{SocketCount} sockets open for processing request {Message} {RequestId}.",
"@l": "Debug",
"SocketCount": 8,
"Message": "HaveMessage",
"RequestId": "87a7d002-d4e7-431c-9585-c7ecd82091d4",
"Subtag": "SocketCount",
"Tag": "Metric",
"Source": "NetMQTransport",
"SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
{
"@t": "2023-02-26T09:21:53.1838345Z",
"@mt": "Trying to connect {RequestId}.",
"@l": "Debug",
"RequestId": "0f6c64ed-20b7-4df2-9c8f-f2d188ffaae4",
"Source": "NetMQTransport",
"SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
{
"@t": "2023-02-26T09:21:58.1873931Z",
"@mt": "{SocketCount} sockets open for processing request {Message} {RequestId}.",
"@l": "Debug",
"SocketCount": 7,
"Message": "WantMessage",
"RequestId": "0f6c64ed-20b7-4df2-9c8f-f2d188ffaae4",
"Subtag": "SocketCount",
"Tag": "Metric",
"Source": "NetMQTransport",
"SourceContext": "Libplanet.Net.Transports.NetMQTransport"
}
When validator seemed to be hanged, this kind of logs are frequently observed.
Trying to connect
and sockets open for processing request
logs have to be written continuously, but there are some time gaps of several seconds.
This may block below iteration, and ProcessRuntime()
task can eventually stuck.
https://github.com/planetarium/libplanet/blob/7a7faba7354f74728fa8a54562e1b8035c66d2bd/Libplanet.Net/Transports/NetMQTransport.cs#L720-L736
So, changed those lines with Connect
like below,
_logger.Debug("Trying to connect {RequestId}.", req.Id);
await Task.Factory.StartNew(
async () => dealer.Connect(await req.Peer.ResolveNetMQAddressAsync()), cancellationToken);
and requested node that was suffering from this issue to apply modified image.
If the problem seems to be resolved, PR will be opened.
Have to investigate log between 6079734
~ 6079755
.
During above period, node seems to be shut down.
Another case : 6096220
~ 6096222
https://www.notion.so/planetarium/6052928-6052934-3c2f73f1c8d94930a6b7e886f6e66a45?pvs=4
On internal PBFT test, single node was absent on block index between
6016301
to6016305
, though he was online. Seems abnormal, so case study is needed. Internal document