pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.28k stars 5.84k forks source link

Support `STORING` columns INDEX #18616

Open zhangjinpeng87 opened 4 years ago

zhangjinpeng87 commented 4 years ago

Feature Request

Is your feature request related to a problem? Please describe:

Assume we have a table named t, and there is a secondary index on column col3 and col4:

CREATE TABLE t (
  id INT primary key,
  col2 INT,
  col3 INT,
  col4 char(32), 
  key idx_col2_col3(col2, col3),
);

SELECT col4 from t WHERE col2=1 AND col3>5, this query will search the related id by the index idx_col2_col3 at first, and then read the value of col4, we call this process as double read.

Describe the feature you'd like:

If we store the column col4 in the index idx_col2_col3, then we can read the value of col4 directly. The CREATE statement looks like that:

CREATE TABLE t (
  id INT primary key,
  col2 INT,
  col3 INT,
  col4 char(32), 
  key idx_col2_col3(col2, col3) STORING (col4),
);

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

zz-jason commented 4 years ago

@zhangjinpeng1987 Are there any other references about the storing column index?

scsldb commented 4 years ago

col4 add to idx_col2_col3 index, maybe the problem is solved.

zhangjinpeng87 commented 4 years ago

@scsldb They are different if there are several columns that need to retrieve or the value of these columns is large. Add columns to index means these columns will be encoded to the index key(large keys are not friendly to the storage engine), but if storing columns, these columns will be stored in the value.

@zz-jason https://www.cockroachlabs.com/docs/stable/indexes.html#storing-columns

zhangjinpeng87 commented 4 years ago

@scsldb @zz-jason Cloud-Spanner also has such a feature

Following description copy from https://cloud.google.com/spanner/docs/whitepapers/optimizing-schema-design

_STORING index clause Secondary indexes allow you to find rows by attributes other than the primary key. If all the data requested is in the index itself, it can be consulted on its own without reading the primary record. This can save significant resources as no join is required.

Unfortunately, index keys are limited to 16 in number and 8 KiB in aggregate size, restricting what can be put in them. To compensate for these limitations, Cloud Spanner has the ability to store extra data in any index, via the STORING clause. STORING a column in an index results in its values being duplicated, with a copy stored in the index. You can think of an index with STORING as a simple single table materialized view (views are not natively supported in Cloud Spanner at this time).

Another useful application of STORING is as part of a NULL_FILTERED index. This allows you to define what is effectively a materialized view of a sparse subset of a table that you can scan efficiently. For example, you might create such an index on the isunread column of a mailbox to be able to serve the unread messages view in a single table scan, but without paying for a complete copy of every mailbox.

zz-jason commented 4 years ago

This feature is awesome:

But in order not to waste the storage and increase the write latency caused by storing unnecessary columns, which columns to be stored in the secondary index should be well-designed.

tirsen commented 3 years ago

This indeed looks very interesting!

It will also help in the following scenario:

CREATE TABLE payment (
  id           BIGINT PRIMARY KEY,
  sender_id    BIGINT,
  recipient_id BIGINT,
  amount_cents BIGINT
);

SELECT SUM(amount_cents) AS amount_sent
FROM payment
WHERE sender_id = ?;

SELECT SUM(amount_cents) AS amount_received
FROM payment
WHERE recipient_id = ?;

Both of these queries will do a double read and if the table is very large a large fan out scatter-gather query to fetch all the amount_cents data. Since the scatter-gather query has such a wide fan out we will not get benefit of pushing the aggregation operation down in tikv.

If we added these indexes:

ALTER TABLE payment ADD KEY idx_sender_id (sender_id) STORING (amount_cents);
ALTER TABLE payment ADD KEY idx_recipient_id (recipient_id) STORING (amount_cents);

This is very difficult to achieve in a traditionally sharded system where you can only pick one column to shard by (recipient or sender) and the other side will be punished. Regular clustered primary key indexes such as those that will be released in 5.0 will also not solve for both of these queries, only one.