tronprotocol / java-tron

Java implementation of the Tron whitepaper
GNU Lesser General Public License v3.0
3.72k stars 1.4k forks source link

Supply data splitting in system crash scenario for Toolkit #5700

Open tomatoishealthy opened 8 months ago

tomatoishealthy commented 8 months ago

Background

The current database adopts a multi-instance model with a checkpoint mechanism to ensure the atomicity for write-op. The Checkpoint has evolved through two versions.

Now, when using the Toolkit to split a snapshot, the data in the Checkpoint and DbStore need to be merged to obtain the complete snapshot data.

When generating a snapshot with Checkpoint v2, only the data of the latest block is read. However, Checkpoint v2 is composed of multiple consecutive blocks. This behavior may miss some data.

Example: One Checkpoint whose version is v2 retains the data of three blocks, which are block1 & block2 & block3. It is expected to obtain the block body of blocknumber = 2 which only exists in block2 (and due to a crash, DbStore did not persist it in time.). But if only the data of block3 is retrieved, the result will be null, and all blocks should be traversed in reverse order to obtain the corresponding data.

function getDataFromSourceDB(String sourceDir, String dbName, byte[] key){
   step 1. get the checkpoint
   step 2. try to find the specified key from the checkpoint
   step 3. if nothing is found, try to read from the DbStores
}

function getCheckpointDb(String sourceDir) {
   version = getCheckpointVersion()
   // This action will miss some data.
   if (version == 2)
      // which only contains the latest block changeset
      return the DBInterface

   if (version == 1) { return tmp db }
}

Rationale

For security and convenience considerations, an external interface should be provided and responsible for all database query operations. It is prohibited to skip this interface to access the database.

The interface should meet the following conditions:

  1. Able to identify the checkpoint version of the current database, and correctly provide data query in checkpoint
  2. Able to merge data in the Stores and checkpoints and provide correct data

Specification

Test Specification

Scope Of Impact

This issue is to fix data inconsistency incurred by splitting DB with Toolkit and will not affect the fullnode.

Implementation

The changes:

  1. Add a new global hashMap to store the data of checkpoint v2: checkpointV2FlatMap
  2. Check the current checkpoint version when the Toolkit starts
    • If the version is v2, obtain the checkpoint list and merge all data intocheckpointV2FlatMap in order which is initFlatCheckpointV2(). This logic needs to be placed in the first step of service startup to ensure that any subsequent read operations can get the correct data.
  3. Encapsulates the query interface getDataFromSourceDB(), and all queries from the original database are unified through this interface

For getDataFromSourceDB():

getDataFromSourceDB() logic

public byte[] getDataFromSourceDB(String sourceDir, String dbName, byte[] key)
          throws IOException, RocksDBException {
    byte[] keyInCp = Bytes.concat(simpleEncode(dbName), key);
    byte[] valueInCp = null;
    DBInterface sourceDb = DbTool.getDB(sourceDir, dbName);
    // get data from checkpoint first.
    if (getCheckpointV2List(sourceDir).size() > 0) {
      valueInCp = checkpointV2FlatMap.get(WrappedByteArray.of(keyInCp));
    } else {
      valueInCp = DbTool.getDB(sourceDir, CHECKPOINT_DB).get(keyInCp);
    }
    byte[] value;
    if (isEmptyBytes(valueInCp)) {
      value = sourceDb.get(key);
    } else {
      value = DBUtils.Operator.DELETE.getValue() == valueInCp[0]
          ? null : Arrays.copyOfRange(valueInCp, 1, valueInCp.length);
    }
    if (isEmptyBytes(value)) {
      throw new RuntimeException(String.format("data not found in store, dbName: %s, key: %s",
              dbName, Arrays.toString(key)));
    }
    return value;
  }

Init checkpointV2FlatMap

public void initFlatCheckpointV2(String path)
      throws IOException, RocksDBException {
    List<String> cpList = getCheckpointV2List(path);
    if (cpList.size() == 0) {
      return;
    }
    checkpointV2FlatMap = Maps.newHashMap();
    // reverse iteration
    for (String cp: cpList) {
      DBInterface db = DbTool.getDB(path + "/" + DBUtils.CHECKPOINT_DB_V2, cp);
      DBIterator it = db.iterator();
      it.seekToFirst();
      while(it.hasNext()) {
        checkpointV2FlatMap.put(WrappedByteArray.of(it.getKey()), it.getValue());
        it.next();
      }
      it.close();
    }
  }

public void generateSnapshot(String sourceDir, String snapshotDir) {
  ...
  snapshotDir = Paths.get(snapshotDir, SNAPSHOT_DIR_NAME).toString();
  try {
    initFlatCheckpointV2(sourceDir);
    hasEnoughBlock(sourceDir);
    }
  ......
}

public void generateHistory(String sourceDir, String historyDir) {
    ....
    historyDir = Paths.get(historyDir, HISTORY_DIR_NAME).toString();
    try {
      if (isLite(sourceDir)) {
        throw new IllegalStateException(
            String.format("Unavailable sourceDir: %s is not fullNode data.", sourceDir));
      }
      initFlatCheckpointV2(sourceDir);
      hasEnoughBlock(sourceDir);
    }
    .....
  }

public void completeHistoryData(String historyDir, String liteDir) {
  ....
  try {
    // check historyDir is from lite data
    if (isLite(historyDir)) {
      throw new IllegalStateException(
          String.format("Unavailable history: %s is not generated by fullNode data.",
              historyDir));
    }
    initFlatCheckpointV2(liteDir);
   ....    
}
halibobo1205 commented 8 months ago

snapshot generation: is this org.tron.core.db2.core.Snapshot?

tomatoishealthy commented 8 months ago

snapshot generation: is this org.tron.core.db2.core.Snapshot?

Of course not, the content description refers to the snapshot dataset generated by the lite tool, which is used for a fullnode quickstart. To eliminate the confusion who don’t understand, I will change the title.

halibobo1205 commented 8 months ago

One more question: Does it crash during snapshot generation using Litetool?

tomatoishealthy commented 8 months ago

One more question: Does it crash during snapshot generation using Litetool?

No, the crash means that the entire database may have experienced DbStore chaos due to disasters such as abnormal shutdowns, but the data of the entire database is complete.

Titles don't convey everything well because of the length limitation.

Your confusion can all be answered from the content, so I can now assume that the title is confusing to you.

I haven't found a more suitable title right now, so do you have any suggestions?

halibobo1205 commented 8 months ago

@tomatoishealthy No more confusion for now.

tomatoishealthy commented 7 months ago

I want to state this issue at Core Devs Community Call 12

tomatoishealthy commented 7 months ago

Development has already begun and is expected to be completed next week.

tomatoishealthy commented 4 months ago

Development is basically completed: https://github.com/tomatoishealthy/java-tron/tree/hotfix/snapshot-inconsistent-for-litetool

Plan to merge in release v4.8.0.

halibobo1205 commented 3 months ago

Trace #5876 for working.