fix: Incorrect unique index key when table is not intHandle & Duplicate values for unique indexes

qidi1 commented 2 years ago

What problem does this PR solve?

close #214
close #232

What is changed and how it works?
Incorrect unique index key when the table is not intHandle
- Why does this problem occur
In TiDB, if there are null values in two unique indexes, then these two unique indexes are not conflicting. In order to achieve the purpose of no conflict, TiDB encodes the rows with null in the unique index by adding the clustered index or RowID to the key of the unique index. But in TiSpark, when the clustered index is not Integer type, the clustered index will not be added after the encoded key. This leads to a conflict in TiSpark when inserting two rows with null values in the unique index when the clustered index is not of Integer type
- How do we solve this problem
The original logic only adds handle to Rowley when any col in the unique index is if the handle is of type int. It is now added in all cases. The original code:
```
      if (handle.isInt()) {
        // set appendHandle=true when any col in the unique index is null.
         ....
      }
     ...
      }
      if (appendHandle) {
        Key key = TypedKey.toTypedKey(handle, IntegerType.BIGINT);
        keys[keys.length - 1] = key;
      }
```
The changed code:
```
      // set distinct=false when any col in the unique index is null
       ....
     if (!distinct) {
         if (handle.isInt()) {
               keyCdo.write(TypedKey.toTypedKey(handle, IntegerType.BIGINT).getBytes());
           } else {
               keyCdo.write(handle.encodedAsKey());
             }
        }        
```

Duplicate values for unique indexes

Why does this problem occur

This problem is caused by the fact that when we insert data, we can't decode the Handle out of a row with a unique index conflict. For example:

First, let's assume that the unique index data in TIKV is {1: “1”} and the cluster key data is {"1":1,“1“,0}. Now we use TiSpark to insert a row of data {1,“2”,0} to TiDB.

Determine if there are conflicting rows, request the row with unique index data of 1 from TiKV and return the value as {1, "1"}, request the row with cluster index key of "2" from TiKV and return null.
Resolve the cluster index key value from the conflicting row {1: "1"} and delete the row corresponding to the primary key. However, due to the error of the decode function, we can't resolve the correct cluster index key, so we can't delete the row corresponding to the cluster index key,
We insert the unique index data {1:"2"} and the primary key data {"2":1,“2”,0} into TiKV.

At this time, the unique index data in TiKV is {1:"2"} and the primary key data is {"1":1,“1”,0} and {"2":1,"2",0}. Two unique indexes of the same data appear in the database.

How we solve this problem

The index value layout is like that.

   // Value layout:
   //       +-- IndexValueVersion0  (with restore data, or common handle, or index is global)
   //       |
   //       |  Layout: TailLen | Options      | Padding      | [IntHandle] | [UntouchedFlag]
   //       |  Length:   1     | len(options) | len(padding) |    8        |     1
   //       |
   //       |  TailLen:       len(padding) + len(IntHandle) + len(UntouchedFlag)
   //       |  Options:       Encode some value for new features, such as common handle, new collations or global index.
   //       |                 See below for more information.
   //       |  Padding:       Ensure length of value always >= 10. (or >= 11 if UntouchedFlag exists.)
   //       |  IntHandle:     Only exists when table use int handles and index is unique.
   //       |  UntouchedFlag: Only exists when index is untouched.
   //       |
   //       +-- Old Encoding (without restore data, integer handle, local)
   //       |
   //       |  Layout: [Handle] | [UntouchedFlag]
   //       |  Length:   8      |     1
   //       |
   //       |  Handle:        Only exists in unique index.
   //       |  UntouchedFlag: Only exists when index is untouched.
   //       |
   //       |  If neither Handle nor UntouchedFlag exists, value will be one single byte '0' (i.e. []byte{'0'}).
   //       |  Length of value <= 9, use to distinguish from the new encoding.
   //       |
   //       +-- IndexValueForClusteredIndexVersion1
   //       |
   //       |  Layout: TailLen |    VersionFlag  |    Version     ｜ Options      |   [UntouchedFlag]
   //       |  Length:   1     |        1        |      1         |  len(options) |         1
   //       |
   //       |  TailLen:       len(UntouchedFlag)
   //       |  Options:       Encode some value for new features, such as common handle, new collations or global index.
   //       |                 See below for more information.
   //       |  UntouchedFlag: Only exists when index is untouched.
   //       |
   //       |  Layout of Options:
   //       |
   //       |     Segment:             Common Handle                 |     Global Index      |   New Collation
   //       |     Layout:  CHandle flag | CHandle Len | CHandle      | PidFlag | PartitionID |    restoreData
   //       |     Length:     1         | 2           | len(CHandle) |    1    |    8        |   len(restoreData)
   //       |

When the clustered index is not int type, the original decodeHandle cannot decode the value of unqiue index correctly, so we added a new decode method decodeHandle.The new decode code is same as TiDB

The logic for decoding is shown in the flowchart

 graph TD
      A[decodeHandleInUniqueIndexValue] --> B{isCommHandle}
      B -->|Yes| C{getIndexVersion}
      C-->|1|D[decode<br/>commonHandle version1]
        C-->|0|F[decode commmonHandle <br/>version0 new encoding]
      B --No-->E{encodeData <br/>length < 9}
      E --Yes--> G[decode version0<br/> old encoding]
      E --No--> H[decode version0<br/> new encoding]

Check List

Tests

Unit test
Integration test

Side effects

Possible performance regression

CLAassistant commented 2 years ago

All committers have signed the CLA.