Test that the multi-cols' statistics is based on sampling.

winoros commented 3 years ago

Development Task

Currently, we are constructing the index(i.e. multi-column for most cases)'s statistics based on full data. This will make the analyze run a very long time after the command is triggered. And this will cause issue like https://github.com/pingcap/tidb/issues/23453.

So we're testing to construct the index's statistics based on sampling to see whether it will lose precision on row count estimation.

If the ones based on sampling are okay, we're able to construct the index's statistics by sampling on tidb data. Thus we don't need to scan the indexes' KV any more.

winoros commented 3 years ago

It's under testing, and will give report soon. Once it satisfies our needs, a proposal will be given.

BTW, nearly all commercial RDBMS choose statistics based on sampling as the default ones.

winoros commented 3 years ago

We've got a rather good result on the sampling tech to estimate the row count of index.

mul-cols-point-query-on-index-zipf5000-bar mul-cols-range-query-on-index-zipf5000-bar

The one whose error rate is closer to 0 is the better one. Compared with 4.0.8 you can see that it's a huge improvement.

pingcap / tidb

Test that the multi-cols' statistics is based on sampling. #23472

Development Task