Open zhangjinpeng87 opened 4 years ago
@zhangjinpeng1987 Are there any other references about the storing
column index?
col4 add to idx_col2_col3
index, maybe the problem is solved.
@scsldb They are different if there are several columns that need to retrieve or the value of these columns is large. Add columns to index means these columns will be encoded to the index key(large keys are not friendly to the storage engine), but if storing columns, these columns will be stored in the value.
@zz-jason https://www.cockroachlabs.com/docs/stable/indexes.html#storing-columns
@scsldb @zz-jason Cloud-Spanner also has such a feature
Following description copy from https://cloud.google.com/spanner/docs/whitepapers/optimizing-schema-design
_STORING index clause Secondary indexes allow you to find rows by attributes other than the primary key. If all the data requested is in the index itself, it can be consulted on its own without reading the primary record. This can save significant resources as no join is required.
Unfortunately, index keys are limited to 16 in number and 8 KiB in aggregate size, restricting what can be put in them. To compensate for these limitations, Cloud Spanner has the ability to store extra data in any index, via the STORING clause. STORING a column in an index results in its values being duplicated, with a copy stored in the index. You can think of an index with STORING as a simple single table materialized view (views are not natively supported in Cloud Spanner at this time).
Another useful application of STORING is as part of a NULL_FILTERED index. This allows you to define what is effectively a materialized view of a sparse subset of a table that you can scan efficiently. For example, you might create such an index on the isunread column of a mailbox to be able to serve the unread messages view in a single table scan, but without paying for a complete copy of every mailbox.
This feature is awesome:
But in order not to waste the storage and increase the write latency caused by storing unnecessary columns, which columns to be stored in the secondary index should be well-designed.
This indeed looks very interesting!
It will also help in the following scenario:
CREATE TABLE payment (
id BIGINT PRIMARY KEY,
sender_id BIGINT,
recipient_id BIGINT,
amount_cents BIGINT
);
SELECT SUM(amount_cents) AS amount_sent
FROM payment
WHERE sender_id = ?;
SELECT SUM(amount_cents) AS amount_received
FROM payment
WHERE recipient_id = ?;
Both of these queries will do a double read and if the table is very large a large fan out scatter-gather query to fetch all the amount_cents
data. Since the scatter-gather query has such a wide fan out we will not get benefit of pushing the aggregation operation down in tikv.
If we added these indexes:
ALTER TABLE payment ADD KEY idx_sender_id (sender_id) STORING (amount_cents);
ALTER TABLE payment ADD KEY idx_recipient_id (recipient_id) STORING (amount_cents);
This is very difficult to achieve in a traditionally sharded system where you can only pick one column to shard by (recipient or sender) and the other side will be punished. Regular clustered primary key indexes such as those that will be released in 5.0 will also not solve for both of these queries, only one.
Feature Request
Is your feature request related to a problem? Please describe:
Assume we have a table named
t
, and there is a secondary index on columncol3
andcol4
:SELECT col4 from t WHERE col2=1 AND col3>5
, this query will search the relatedid
by the indexidx_col2_col3
at first, and then read the value ofcol4
, we call this process asdouble read
.Describe the feature you'd like:
If we store the column
col4
in the indexidx_col2_col3
, then we can read the value ofcol4
directly. The CREATE statement looks like that:Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy: