pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.89k stars 5.81k forks source link

Index for TiFlash #33022

Closed phamtai97 closed 2 years ago

phamtai97 commented 2 years ago

Hi all, I am running performance for TiFlash and Clickhouse to compare them. The Clickhouse can order and create the index for the table, so it can perform the SQL fastly. But when I use the TiFlash, it does not support creating the index. I think that the performance won't be as good as Clickhouse.

I have questions:

phamtai97 commented 2 years ago

Can you help me? @tiancaiamao

tiancaiamao commented 2 years ago
  1. The data of TiFlash is replicated from TiKV, How to the index of TikV hosted on TiFlash?

TiFlash can't use TiKV's index... they're two different storage engine. TiDB is the shared computing layer, it'll decide whether to send the query to TiFlash or TiKV But it can't do something like "query the index from TiKV, then use the index result to fetch data from TiFlash" It checks something like "this query is scan heavy, use TiFlash would help, so just choose TiFlash ... this query is a very simple OLTP query, TiKV is better for this task, so just choose TiKV"

  1. The TiFlash is good for scanning data or count or sum by column

Exactly! TiFlash is a column-oriented storage engine, while TiKV is row-oriented. It means for "select count(c) from t", TiKV need to fetch every row and cut the 'c' column, the read amplification is high.

phamtai97 commented 2 years ago

Thank tiancaiamao for your answer.

  1. The data of TiFlash is replicated from TiKV, How to the index of TikV hosted on TiFlash? TiFlash can't use TiKV's index... they're two different storage engine. TiDB is the shared computing layer, it'll decide whether to send the query to TiFlash or TiKV But it can't do something like "query the index from TiKV, then use the index result to fetch data from TiFlash" It checks something like "this query is scan heavy, use TiFlash would help, so just choose TiFlash ... this query is a very simple OLTP query, TiKV is better for this task, so just choose TiKV"

=> Currently, the TiFlash does not support query index => It affects the performance of query? In the future, Are you planning to implement an index for TiFlash?

  1. The TiFlash is good for scanning data or count or sum by column Exactly! TiFlash is a column-oriented storage engine, while TiKV is row-oriented. It means for "select count(c) from t", TiKV need to fetch every row and cut the 'c' column, the read amplification is high.

=> But when I explain the count query, the tidb chooses the TiKV to perform this query?

Screen Shot 2022-03-13 at 20 31 49

Do you have a comparison test between TiFlash and Clickhouse?

phamtai97 commented 2 years ago

Can you help me? @tiancaiamao

tiancaiamao commented 2 years ago

In the future, Are you planning to implement an index for TiFlash?

If you mean get index from TiKV and then use the index in TiFlash, no... the organization of data in the two engine is totally different and this is impossible...

But when I explain the count query, the tidb chooses the TiKV to perform this query?

Because where account_id = ? can use TiKV index, so it's fast to run the query in TiKV, so TiDB choose the TiKV storage engine. If this query is execute in TiFlash, the need to scan a lot of data and filter out the where account_id = ? condition to get the final result.

Do you have a comparison test between TiFlash and Clickhouse?

I don't know much about Clickhouse, and do I know much about TiFlash, neither.

phamtai97 commented 2 years ago

If you mean get index from TiKV and then use the index in TiFlash, no... the organization of data in the two engine is totally different and this is impossible.

No, I mean is TiFlash indexed in the future. You can index of the Clickhouse: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#available-types-of-ind

tiancaiamao commented 2 years ago

I'm not familiar with the TiFlash roadmap, sorry~ PTAL @flowbehappy

flowbehappy commented 2 years ago

Hi @phamtai97,

I have questions:

  • The data of TiFlash is replicated from TiKV, How to the index of TikV hosted on TiFlash?

TiFlash synchronizes only table data from TiKV, without the indexes. And TiFlash follows the rules that data is sorted by Primary Key. Indexes like secondary indexes created by TiKV are not used by TiFlash. However, TiFlash automatically creates its own kinds of indexes which are more suitable for AP workloads. Currently, we support Min-Max Index for Integer and DateTime data types. And there are plans for more index types in the future.

  • The TiFlash is good for scanning data or count or sum by column (column is created index by TiKV)? Because I explain the SQL query, the result is TiKV. For example:

Yes. TiFlash is playing the "AP" role of HTAP in TiDB databases. So it is designed for running fast scan and fast aggregation like sum. TiFlash synchronizes table schema modifications in realtime.

About your example, did you create TiFlash replicas? TiFlash replicas should be created manually. Check this document for details:

https://docs.pingcap.com/tidb/v6.0/use-tiflash

phamtai97 commented 2 years ago

@flowbehappy thank you for your answer.