planetarium / libplanet

Blockchain in C#/.NET for on-chain, decentralized gaming
https://docs.libplanet.io/
GNU Lesser General Public License v2.1
506 stars 139 forks source link

`BlockCandidateProcess` fails during preloading #3253

Closed OnedgeLee closed 11 months ago

OnedgeLee commented 1 year ago

During preloading, BlockCandidateProcess() fails to append blocks. Log is attached on the below :

[2023-06-29T04:59:05.6602355Z Error] [Swarm] Unexpected exception occured during PullBlocksAsync() 
System.Collections.Generic.KeyNotFoundException: The given hash[64b24ad495996ea71a26e8a632ff40fda62cd58384f8a9b7496d09898d15678b] was not found in this chain.
   at Libplanet.Blockchain.BlockChain.get_Item(BlockHash& blockHash) in /app/Lib9c/.Libplanet/Libplanet/Blockchain/BlockChain.cs:line 327
   at Libplanet.Blockchain.BlockChain.GetBlockCommit(BlockHash blockHash) in /app/Lib9c/.Libplanet/Libplanet/Blockchain/BlockChain.cs:line 953
   at Libplanet.Net.Swarm.PullBlocksAsync(List`1 peersWithBlockExcerpt, Int32 chunkSize, IProgress`1 progress, CancellationToken cancellationToken) in /app/Lib9c/.Libplanet/Libplanet.Net/Swarm.BlockSync.cs:line 210
[2023-06-29T04:59:05.6614931Z Debug] [Swarm] "PullBlocksAsync"() has finished successfully 
[2023-06-29T04:59:05.6615820Z ] [Swarm] Consuming branch with root #7097033 "ba427cabafe21e2bab69510768013ebe656f36d149ad93e2d227c848f68b690f" and tip #7098025 "0f2effd84cea5aa6a3c6c34de3926488594a44f9e117f82d2eaf578d9c434afb" 
[2023-06-29T04:59:05.6616992Z Debug] [Swarm] "BlockCandidateProcess"() starts to append; current tip is #7097027 "ba24655bb107259c2a1bf4fb71a2dd467f756ef806206d293fca7c6fa212f91c" 
[2023-06-29T04:59:05.6918979Z Error] [Swarm] "BlockCandidateProcess"() has failed to append blocks 
System.Collections.Generic.KeyNotFoundException: The given hash[64b24ad495996ea71a26e8a632ff40fda62cd58384f8a9b7496d09898d15678b] was not found in this chain.
   at Libplanet.Blockchain.BlockChain.get_Item(BlockHash& blockHash) in /app/Lib9c/.Libplanet/Libplanet/Blockchain/BlockChain.cs:line 327
   at Libplanet.Net.Swarm.FindBranchpoint(Block oldTip, Block newTip, List`1 newBlocks) in /app/Lib9c/.Libplanet/Libplanet.Net/Swarm.BlockCandidate.cs:line 272
   at Libplanet.Net.Swarm.AppendPreviousBlocks(BlockChain blockChain, Branch candidate, Boolean render, IProgress`1 progress) in /app/Lib9c/.Libplanet/Libplanet.Net/Swarm.BlockCandidate.cs:line 138
   at Libplanet.Net.Swarm.BlockCandidateProcess(Branch candidate, Boolean render, IProgress`1 progress, CancellationToken cancellationToken) in /app/Lib9c/.Libplanet/Libplanet.Net/Swarm.BlockCandidate.cs:line 123

Error has been thrown on FindBranchpoint(), code link

private Block FindBranchpoint(Block oldTip, Block newTip, List<Block> newBlocks)
{
    while (oldTip is Block && oldTip.Index > newTip.Index &&
           oldTip.PreviousHash is { } aPrev)
    {
        oldTip = BlockChain[aPrev];
    }

    while (newTip is Block && newTip.Index > oldTip.Index &&
           newTip.PreviousHash is { } bPrev)
    {
        try
        {
            newTip = newBlocks.Single(x => x.Hash.Equals(bPrev)); <--- Fails
        }
        catch (ArgumentNullException)
        {
            newTip = BlockChain[bPrev];
        }
        catch (InvalidOperationException)
        {
            newTip = BlockChain[bPrev]; <--- Fails, Thrown
        }
    }
...

Parameter newBlocks have to contain blocks between oldTip and newTip but it seems not. It means, that Branch candidate of BlockCandidateProcess() does not contain full blocks starts from current tip. This may caused by failure on PullBlocksAsync().

On investigation, below line on PullBlocksAsync() seems to cause NullReferenceError. code link

BlockHash? previousHash = blocks.First().Item1.PreviousHash;
OnedgeLee commented 1 year ago

Since IEnumerable.First will cause ArgumentException on null source, InvalidOperationException on empty source, so seems okay. Every block has nullable PreviousHash, so seems okay for this, too. The most probable case is that the element of blocks is not a type of Tuple<Block, ..., ...>.

OnedgeLee commented 1 year ago

Seems like this problem has been caused from generating empty Branch after full iteration has been ended. When I see the log, before above problem occurs, GetBlocksAsync() tries to fetch from 0 hashes, and right before it, it fetches smaller chunk than chunk size we set.

[2023-06-29T04:59:20.5288099Z ] [Swarm] Trying to download 93 block(s) from "0xa6e2052C389dd76E3B72356a454A2c40E91A9BC9.Unspecified/nld-nc-1.ninodes.com:31234."... 
[2023-06-29T04:59:21.9151275Z Debug] [Swarm] Received replies from "0xa6e2052C389dd76E3B72356a454A2c40E91A9BC9.Unspecified/nld-nc-1.ninodes.com:31234." 
[2023-06-29T04:59:21.9151682Z ] [Swarm] Received 186 blocks from "0xa6e2052C389dd76E3B72356a454A2c40E91A9BC9.Unspecified/nld-nc-1.ninodes.com:31234." 
[2023-06-29T04:59:22.3407983Z ] [Swarm] Downloaded 93 block(s) from "0xa6e2052C389dd76E3B72356a454A2c40E91A9BC9.Unspecified/nld-nc-1.ninodes.com:31234." 
[2023-06-29T04:59:22.4370240Z ] [Swarm] Trying to download 0 block(s) from "0x117297F1CF9376Cd2472F685392220ED960D97Db.Unspecified/backup-nc-1.ninodes.com:31234."... 
[2023-06-29T04:59:22.4372711Z Error] [Swarm] Unexpected exception occured during PullBlocksAsync() 
System.NullReferenceException: Object reference not set to an instance of an object.
   at Libplanet.Net.Swarm.PullBlocksAsync(List`1 peersWithBlockExcerpt, Int32 chunkSize, IProgress`1 progress, CancellationToken cancellationToken) in /app/Lib9c/.Libplanet/Libplanet.Net/Swarm.BlockSync.cs:line 210

But this won't cause above error, since this empty chunk won't be queued since iteration will be exited immediately.

OnedgeLee commented 1 year ago

Previous assumption was wrong. Exception happens on GetBlockCommit(). link This happens because store have branchpoint block, but tip of blockchain is not the branchpoint. Swarm.PullBlocksAsync() get branchpoint by store, so it won't make any error. but when it tries to get BlockCommit, it fetch block from blockchain, and fails, since chain does not have that block.

OnedgeLee commented 1 year ago

When I query with branchpoint that failed to fetch, failed node returns correct block. But when I query tip, it's previous block of branchpoint. The former get the block from store, and the latter get the block from blockchain. So, the problem is that the block is on the store, but not on the blockchain.

{
  "data": {
    "blockQuery": {
      "block": {
        "index": 7146015,
        "previousBlock": {
          "hash": "73662156867acc4d4c04c0cc095e8e59dc79e4aab6b70049eaf3a169ee7c4a53"
        }
      }
    }
  }
}
{
  "data": {
    "blockQuery": {
      "blocks": [
        {
          "hash": "73662156867acc4d4c04c0cc095e8e59dc79e4aab6b70049eaf3a169ee7c4a53",
          "index": 7146014
        }
      ]
    }
  }
}
[04:09:44 DBG] [Swarm] 1101040927/562539105: Sending a GetBlockHashesMsg message with locator [73662156867acc4d4c04c0cc095e8e59dc79e4aab6b70049eaf3a169ee7c4a53, ...] (stop: null)...
[04:09:44 DBG] [Swarm] 1101040927/562539105: Received a BlockHashesMsg message with an offset index 7146014 (total 500 hashes)
[04:09:44 DBG] [Swarm] 486290445/1195604464: Sending a GetBlockHashesMsg message with locator [00be11ff9ca41cc3793953579c7cd5ad3ec7dd037a27803ff09fe70d1d69ff02, ...] (stop: null)...
[04:09:44 DBG] [Swarm] 486290445/1195604464: Received a BlockHashesMsg message with an offset index 7146513 (total 339 hashes)
[04:09:44 DBG] 8bcdf19947231be4337c211045e26be094364b4b3c11d515769084d27c78c650,4d4aa5112f606036c5cab0c14b5efad9a75ee5df02ce5505e6f8d92be5ad38ee,3380b0c20e294775aff44fc33f32ffd81ccb412a03ee7d805fc8c33f81290269
[04:09:44 DBG] Downloaded block #7146016 8bcdf19947231be4337c211045e26be094364b4b3c11d515769084d27c78c650 from 0x7655122899Ccd2cC87B38A35D7158c9516504119.Unspecified/9c-internal-validator-2.nine-chronicles.com:31234.
[04:09:44 DBG] [Swarm] Got #7146016 8bcdf19947231be4337c211045e26be094364b4b3c11d515769084d27c78c650 from 0x7655122899Ccd2cC87B38A35D7158c9516504119.Unspecified/9c-internal-validator-2.nine-chronicles.com:31234.
[04:09:44 DBG] Completed block #7146016 8bcdf19947231be4337c211045e26be094364b4b3c11d515769084d27c78c650 from {"block_hash":"8bcdf19947231be4337c211045e26be094364b4b3c11d515769084d27c78c650","height":7146016,"round":0,"votes":[{"validator_public_key":"021ebc027706c9b7fdb03bc212657241197c4f7d4f122cbb66f347bee3bfd39551","flag":2,"timestamp":"07/03/2023 01:56:30 \u002B00:00"},{"validator_public_key":"0260972c353ba9b1e630d7b488ef0e9250a86286fbc8541e1bcca82f1a50cf8012","flag":2,"timestamp":"07/03/2023 01:56:30 \u002B00:00"},{"validator_public_key":"029a46496641787a06db32d83dfea3b50ebb681fae9bbf4e60b41ee91d4024965b","flag":2,"timestamp":"07/03/2023 01:56:30 \u002B00:00"},{"validator_public_key":"03310066ad080de4bea6042163cade4ab777a1ccb45abfada0973352b34ca0b497","flag":0,"timestamp":"07/03/2023 01:56:37 \u002B00:00"}]}

We have to download from #7146015, but for now, #7146016

#7146015 was not demanded by below link

if (blockCompletion.Demand(hash))
{
    totalBlocksToDownload++;
}

Seems like this block hash has been omitted on below link

var demandBlockHashes = GetDemandBlockHashes(
    BlockChain,
    peersWithBlockExcerpt,
    chunkSize,
    progress,
    cancellationToken
).WithCancellation(cancellationToken);

But according to its log, it's already downloaded. So I suspect it's downloaded, but discarded by some reason.

OnedgeLee commented 11 months ago

I've received a tip.

  1. Block height was 7385495.
  2. Applied truncation to 7385394.
  3. Preloading tries to append block from 7385495.
    [10:10:59 ERR] [Swarm] Unexpected exception occured during PullBlocksAsync()
    System.Collections.Generic.KeyNotFoundException: The given hash[ead9480e2ac1ec194e3650c6333b94259f325d35df25ca76e0e55cb84ed3045a] was not found in this chain.
    at Libplanet.Blockchain.BlockChain.get_Item(BlockHash& blockHash) in /Users/yang/projects/headless/Lib9c/.Libplanet/Libplanet/Blockchain/BlockChain.cs:line 313
    at Libplanet.Blockchain.BlockChain.GetBlockCommit(BlockHash blockHash) in /Users/yang/projects/headless/Lib9c/.Libplanet/Libplanet/Blockchain/BlockChain.cs:line 953
    at Libplanet.Net.Swarm.PullBlocksAsync(List`1 peersWithBlockExcerpt, Int32 chunkSize, IProgress`1 progress, CancellationToken cancellationToken) in /Users/yang/projects/headless/Lib9c/.Libplanet/Libplanet.Net/Swarm.BlockSync.cs:line 210
    [10:10:59 INF] [BlockCandidateTable] There are 1 branches in BlockCandidateTable
    [10:10:59 DBG] [Swarm] PullBlocksAsync() has finished successfully
    [10:10:59 INF] [Swarm] Consuming branch with root #7385496 182e57e9e5bdffe40cfc3ea58ed2b6c701f200ecd6ba0833e14611284640abc3 and tip #7385643 fd79f29d9546bbd2fbde8c1982932bbd6ad3317e7da55a155dcfa3ca9deb9a68
    [10:10:59 DBG] [Swarm] BlockCandidateProcess() starts to append; current tip is #7385394 1cfcddac89d76eb45bdf0c1f140f2a55be009740831c1a1e47b029306f2f3674
    [10:10:59 ERR] [Swarm] BlockCandidateProcess() has failed to append blocks
    System.Collections.Generic.KeyNotFoundException: The given hash[ead9480e2ac1ec194e3650c6333b94259f325d35df25ca76e0e55cb84ed3045a] was not found in this chain.

query { nodeStatus { tip { index } } } indicates tip as 7385394

from here, tipHeader is 7385394 and root is 7385496.

What is clear, it's not the problem with +-1 indexing, it's much likely to be related with gap between Store and BlockChain.

moreal commented 11 months ago

Reason

I also looked into this issue. As @OnedgeLee had researched, it is the gap between Store and BlockChain. In detail, I think it is the problem of using different block storage sources than the gap between Store and BlockChain.

Let's do a root cause analysis.

Its symptom is throwing KeyNotFoundException:

[10:10:59 ERR] [Swarm] BlockCandidateProcess() has failed to append blocks
System.Collections.Generic.KeyNotFoundException: The given hash[ead9480e2ac1ec194e3650c6333b94259f325d35df25ca76e0e55cb84ed3045a] was not found in this chain.

The source thrown the exception is here:

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/Swarm.BlockCandidate.cs#L270-L273

It tries to get the block from BlockChain when it doesn't exist in newBlocks. Then what are newBlocks? They are blocks fetched by BlockCompletion, PullBlockAsync across the network.

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/Swarm.BlockCandidate.cs#L138-L141

And I checked that Swarm receives incorrect block hashes (i.e., not including the block hash) as GetBlockHashesMsg's response. But it works well when debugging.

But it dropped when demanding the block hash. Because it calls BlockChain.Store.ContainsBlock through Satifies(), BlockCompletion._completionPredicate.

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/Swarm.BlockSync.cs#L133-L136

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/BlockCompletion.cs#L114-L118

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/BlockCompletion.cs#L28-L31

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/Swarm.BlockSync.cs#L91C1-L93

So if the block already exists in your store, it will occur forever 🙀

When it can occur. (guess)

How to bypass this situation

Just delete the block from your store.

IStore store = storeType.CreateStore(storePath);
var bh = BlockHash.FromString(blockHash);
store.DeleteBlock(bh);

How to resolve this situation

I think it has to get a block from the store because it checks whether the block exists in the store, not the blockchain. If it is not intended.

https://github.com/planetarium/libplanet/blob/3aad8790cb03d031538787ab4b59a3fcdce820f9/Libplanet.Net/Swarm.BlockCandidate.cs#L266C23-L274

catch (ArgumentNullException)
{
    newTip = BlockChain.Store.GetBlock(bPrev);
}
catch (InvalidOperationException)
{
    newTip = BlockChain.Store.GetBlock(bPrev);
}

But it was the incorrect way... It should replace _completionPredicate with BlockChain.ContainsBlock