heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.96k stars 448 forks source link

vs Oracle11.2.0.4,Why is HeavyDB6.4 slower? #802

Closed jieguolove closed 1 year ago

jieguolove commented 1 year ago

image

Prepare data:

The addressinfo table has 30 million records。 sample and extract data from each prefecture and city as randomly as possible from the existing address library. The sample data contains two query criteria of regionid,code. Test method: execute the query statement for each sample data: For every 1000 pieces of data processed by select id from addressinfo where regionid='xxx' and code like'% xxx%', record the time of this round: (seconds). HeavyDB:490 、 557 、 553 、 704 、 941. Oracle:611 、 470 、 514 、 525 、 544. Overall, Oracle queries are faster and more stable than HeavyDB.

Other observed conclusions: If only 1000 pieces of data are repeated five times, the HeavyDB cache works, and the execution time is: 590,20,19,19,20,21. But the HeavyDB cache didn't work if it was executed five times for 2000 pieces of data, and it was even slower: 5799,1681, 1593, 1705, 1834, 1623. If you execute an equal query: select id from addressinfo where regionid='xxx' and code = 'xxx', Oracle is 10 times more efficient than HeavyDB.

Why is HeavyDB slower? What SQL syntax are the advantages of heavydb?

Thanks!

cdessanti commented 1 year ago

Hi, We have already discussed this topic in the community forum. To recap, the optimizer generates a plan that requires two LLVM compilations each time the literal in like '%literal%' is changed. This operation is costly, especially when the query is executed against a table with such few records. In other words, the query plan that should be reused is regenerated each time, which is the reason for the slowness, and we are currently investigating the issue.

Running the same query in CPU mode is approximately three times faster. So, even with this problem, it should be faster than Oracle.

Link to the community forum post

If you run the query without a 'like' clause, using two equality filters (e.g. select id from addressinfo where regionid='xxx' and code = 'xxx',) the plan is reused, and recompilation doesn't occur. Consequently, each of the queries is likely to run in 20ms in GPU mode. I recently executed 100,000 queries, each with a different filter combination, and it took just 2000 seconds on a table with 130 million records on a single GPU. Due to the relatively low number of records used, the GPU usage remained under 30%.

In any case, our database primarily serves analytical processing, excelling in running analytical queries against flat tables or star schemas. Have you ever tried the SSB Benchmark? If you're interested, I can provide you with instructions to set up the tables with data for a Scaling Factor of 100 in just a matter of minutes.

jieguolove commented 1 year ago

But our actual test results are like the above data, sometimes slower than oracle? When oracle has an index, the equivalent query is also faster than heavydb.How to explain? thanks!

cdessanti commented 1 year ago

Hi, It's possible that our results differ because we are using different data types in table creation. To facilitate analysis, could you please share the result of SHOW CREATE TABLE addressinfo;, along with the timing results of one or more query executions using the heavysql tool with the timing enabled using the \timing command?" We expect an output like this

heavysql> \timing
heavysql> select count(*) from addressinfo;
EXPR$0
30000000
1 rows returned.
Execution time: 32 ms, Total time: 32 ms
heavysql> show create table addressinfo;
Result
CREATE TABLE addressinfo (
  id INTEGER NOT NULL,
  region TEXT ENCODING DICT(32),
  codi TEXT ENCODING NONE);
1 rows returned.
Execution time: 0 ms, Total time: 2 ms
heavysql> select id from addressinfo where region='Harvey County' and codi like '%USEP%';
id
7962700
5358767 
[CUTTED RESULTS]
25931749
5769257
384 rows returned.
Execution time: 859 ms, Total time: 860 ms
heavysql> select id from addressinfo where region='Hidalgo County' and codi like '%ARIA%';
id
19169584
4328935
[CUTTED RESULTS]
26109416
13821300
79 rows returned.
Execution time: 21 ms, Total time: 22 ms
heavysql> select id from addressinfo where region='Laurens County' and codi like '%ETRO%';
id
29610690
11444765
[CUTTED RESULTS]
18252423
20872640
247 rows returned.
Execution time: 19 ms, Total time: 20 ms
heavysql> select id from addressinfo where region='Scott County' and codi like '%TTORI%';
id
10207403
10070795
[CUTTED RESULTS]
19785019
7868219
695 rows returned.
Execution time: 19 ms, Total time: 20 ms
heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
    51316.35 MB      844.79 MB     4096.00 MB     3251.21 MB
HeavyDB Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]    10619.69 MB      730.35 MB     4096.00 MB     3365.65 MB

I've truncated the rows in the result for better post-readability. As you can observe, the initial query takes over 800ms, because load the required columns from disk to CPU/GPU cache (These caches are equivalent to the Buffer Cache section of Oracle's SGA.), and to prepare the execution plan for the statement (just when Oracle saves the execution plan on Shared Pool).

Subsequent queries, while utilizing different filters for regions and codes, reuse the same execution plan from the first query, fetching data from the CPU/GPU Caches.

You can examine and test your queries on my table by restoring a public dump of the table I've shared on an Amazon S3 bucket. You can do this by running the following command in heavysql (or the tool of your choice). restore table addressinfo_2 from 's3://candidodessanti/addressinfo/addressinfo.dmp.gz' with (s3_region='eu-west-1');

I await receiving the output of the requested commands from your system to assist you better.

Candido.

cdessanti commented 1 year ago

For reference this is my system

heavysql> \hardware_info
===========================================
Number of CPU core       :24
Number of GPUs allocated :2
Start GPU                :0
-------------------------------------------
Number of SM             :68
Clock frequency          :1508 MHz
Physical GPU Memory      :10 GB
Compute capability       :7.5
-------------------------------------------
===========================================
heavysql> \status
Server Version                      : 7.0.0-20230926-d315676d23 
--------------------------------------------------
Server Name                         : zion-tr
Server Start Time                   : 2023-10-09 : 14:0:28
jieguolove commented 1 year ago

For reference this is my system

heavysql> \hardware_info
===========================================
Number of CPU core       :24
Number of GPUs allocated :2
Start GPU                :0
-------------------------------------------
Number of SM             :68
Clock frequency          :1508 MHz
Physical GPU Memory      :10 GB
Compute capability       :7.5
-------------------------------------------
===========================================
heavysql> \status
Server Version                      : 7.0.0-20230926-d315676d23 
--------------------------------------------------
Server Name                         : zion-tr
Server Start Time                   : 2023-10-09 : 14:0:28

heavysql> show create table addressinfo;
Result
CREATE TABLE ADDRESSINFO (
id TEXT ENCODING DICT(32),
ods_id TEXT ENCODING DICT(32),
code TEXT ENCODING DICT(32),
province TEXT ENCODING DICT(32),
city TEXT ENCODING DICT(32),
regionid TEXT ENCODING DICT(32),
region_name TEXT ENCODING DICT(32),
street_id TEXT ENCODING DICT(32),
street_name TEXT ENCODING DICT(32),
road_id TEXT ENCODING DICT(32),
road_name TEXT ENCODING DICT(32),
number_plate_id TEXT ENCODING DICT(32),
number_plate_name TEXT ENCODING DICT(32),
cpn_id TEXT ENCODING DICT(32),
cpn_code TEXT ENCODING DICT(32),
community_name TEXT ENCODING DICT(32),
premises_network TEXT ENCODING DICT(32),
building_code TEXT ENCODING DICT(32),
building TEXT ENCODING DICT(32),
unit_id TEXT ENCODING DICT(32),
unit TEXT ENCODING DICT(32),
layer_id TEXT ENCODING DICT(32),
layer TEXT ENCODING DICT(32),
room_id TEXT ENCODING DICT(32),
room TEXT ENCODING DICT(32),
rms_code TEXT ENCODING DICT(32),
substation_id TEXT ENCODING DICT(32),
simple_mark TEXT ENCODING DICT(32),
inner_symbol_id TEXT ENCODING DICT(32),
if_inner TEXT ENCODING DICT(32),
switch_no TEXT ENCODING DICT(32),
type INTEGER,
queryindex TEXT ENCODING DICT(32),
remark TEXT ENCODING DICT(32),
alias_name TEXT ENCODING DICT(32),
if_relate TEXT ENCODING DICT(32),
addresstype TEXT ENCODING DICT(32),
source TEXT ENCODING DICT(32),
create_time TIMESTAMP(0),
fldtag TEXT ENCODING DICT(32),
original_company TEXT ENCODING DICT(32),
useremark TEXT ENCODING DICT(32),
village_code TEXT ENCODING DICT(32),
uptimestamp_time TIMESTAMP(0),
create_op TEXT ENCODING DICT(32));
heavysql> 
heavysql> \hardware_info
===========================================
Number of Physical GPUs  :1
Number of CPU core       :32
Number of GPUs allocated :1
Start GPU                :0
-------------------------------------------
Number of SM             :108
Clock frequency          :1376 MHz
Physical GPU Memory      :39 GB
Compute capability       :8.0
-------------------------------------------
heavysql> \status
Server Version                      : 6.4.0-20230201-87f6fa37b4 
--------------------------------------------------
Server Name                         : node13
Server Start Time                   : 2023-10-07 : 8:29:4
heavysql> \timing
heavysql> select count(*) from addressinfo;
EXPR$0
30748154
1 rows returned.
Execution time: 154 ms, Total time: 155 ms
heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
MAX            USE      ALLOCATED           FREE
206240.20 MB        0.00 MB        0.00 MB        0.00 MB

HeavyDB Server GPU Memory Summary: [GPU] MAX USE ALLOCATED FREE [0] 39986.00 MB 0.00 MB 4096.00 MB 4096.00 MB

heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-2%'; IDbcode 112121212南四经街104-2号2-1-2 1 rows returned. Execution time: 7070 ms, Total time: 7072 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路104-2号2-1%'; No rows returned. Execution time: 3594 ms, Total time: 3595 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路104-2号2-2%'; No rows returned. Execution time: 2982 ms, Total time: 2984 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路104%'; IDbcode 318808b中山路104号 3659012b中山路104号1楼工行中广支行 2676280b中山路104号-工商银行中山广场支行 3665493b中山路104号1楼工行中广支行 4 rows returned. Execution time: 3367 ms, Total time: 3368 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-3%'; IDbcode 112121212南四经街104-2号2-1-3 1 rows returned. Execution time: 2872 ms, Total time: 2874 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-3%'; IDbcode 112121212南四经街104-2号2-1-3 1 rows returned. Execution time: 33 ms, Total time: 35 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-1%'; IDbcode 1942251113南四经街104-2号2-1-1 1 rows returned. Execution time: 798 ms, Total time: 799 ms heavysql> \memory_summary HeavyDB Server CPU Memory Summary: MAX USE ALLOCATED FREE 206240.20 MB 351.88 MB 4096.00 MB 3744.12 MB

HeavyDB Server GPU Memory Summary: [GPU] MAX USE ALLOCATED FREE [0] 39986.00 MB 234.59 MB 4096.00 MB 3861.41 MB

heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路105%'; IDbcode 1222中山路105号 ssdfdf中山路105甲 2 rows returned. Execution time: 3305 ms, Total time: 3306 ms heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路105%'; IDbcode 23223中山路105号 23232323中山路105甲 2 rows returned. Execution time: 30 ms, Total time: 31 ms heavysql> select count(*) from addressinfo; EXPR$0 30748154 1 rows returned. Execution time: 24 ms, Total time: 25 ms heavysql> #

jieguolove commented 1 year ago

I noticed the TEXT field type. When do I use "encoding dict (32) "or "encoding none"?

and The field type cannot be modified even if there is no record in the table:

heavysql> \d a
CREATE TABLE a (
  id INTEGER NOT NULL,
  region TEXT ENCODING DICT(32),
  codi TEXT ENCODING NONE);
heavysql> alter table a alter column codi type text;
ALTER TABLE ALTER COLUMN is not implemented.
heavysql> CREATE TABLE b(
..>   id INTEGER NOT NULL,
..>   region TEXT ENCODING DICT(32),
..>   codi TEXT ENCODING DICT(32));
heavysql> \d b
CREATE TABLE b (
  id INTEGER NOT NULL,
  region TEXT ENCODING DICT(32),
  codi TEXT ENCODING DICT(32));
heavysql> alter table b alter column codi type text;
ALTER TABLE ALTER COLUMN is not implemented.
heavysql> alter table b alter column codi type TEXT ENCODING NONE;
ALTER TABLE ALTER COLUMN is not implemented.
heavysql> select count(*) from a;
EXPR$0
0
1 rows returned.
Execution time: 20 ms, Total time: 20 ms
heavysql> select count(*) from b;
EXPR$0
0
1 rows returned.
Execution time: 18 ms, Total time: 19 ms
heavysql> 
cdessanti commented 1 year ago

We've encountered a sort of bug that was causing the regeneration of the execution plan when using a LIKE filter on a TEXT dictionary-encoded string. This issue has already been addressed in the current master branch and is likely to be included in the next EE/FREE release.

However, for your specific use case, which appears to involve high cardinality strings in the 'code' field, it might be more appropriate to consider using an encoding of 'NONE.'

Additionally, I've noticed that most of your columns use the default encoding size, or in the case of timestamps, no encoding at all. To obtain an 'optimized' Data Definition Language (DDL) for your table, you can use the \o command in heavysql.

Running \o addressinfo will generate a DDL with appropriate encoding sizes for TEXTs and the Timestamp field. You can also review if all ID fields need to be stored as TEXT or can be transformed into numeric data types.

Please note that as of version edit: 6.4.0 not 7.0, in-place conversion of data types is not yet implemented. Therefore, you'll need to add a new field and perform a massive update on your table, as shown in the code snippet below:

ALTER TABLE addressinfo ADD COLUMN code_ne TEXT ENCODING NONE;
UPDATE addressinfo SET code_ne = code;
ALTER TABLE addressinfo DROP COLUMN code;
ALTER TABLE addressinfo RENAME COLUMN code_ne TO code;

Alternatively, you can simply add the new column and follow the drop-and-rename steps.

edit: In 7.0 version you should be able to alter a dict-encoded column into a none-encoded using and ALTER TABLE ... ALTER COLUMN ... TYPE, so the SQL ALTER TABLE addressinfo ALTER TABLE addressinfo ALTER COLUMN code TYPE TEXT ENCODING NONE; should do the trick.

In any case, it's advisable to review the data types with the help of the \o command and then populate the new table in the same way you loaded the current version of the table.

Let me know how it is going, and feel free to ask for further assistance (I know that changing from Oracle to other database can be quite hard).

Candido

jieguolove commented 1 year ago

We've encountered a sort of bug that was causing the regeneration of the execution plan when using a LIKE filter on a TEXT dictionary-encoded string. This issue has already been addressed in the current master branch and is likely to be included in the next EE/FREE release.

We will follow up any progress at any time.Thanks a lot!

jieguolove commented 1 year ago

The effect is not obvious after using the optimized table field type. The bug was causing the regeneration of the execution plan when using a LIKE filter on a TEXT dictionary-encoded string. We will continue to test the latest version of heavydb. Please see the test record below:

heavyai@node13:~$ heavysql -p admin -db lnlt
User admin connected to database lnlt
heavysql> \version
HeavyDB Server Version: 6.4.0-20230201-87f6fa37b4
heavysql> \t
ADDRESSINFO
SID_AREA
SID_LATN
heavysql> \o addressinfo
CREATE TABLE addressinfo (
id TEXT ENCODING DICT(32),
ods_id TEXT ENCODING DICT(32),
code TEXT ENCODING DICT(32),
province TEXT ENCODING DICT(8),
city TEXT ENCODING DICT(8),
regionid TEXT ENCODING DICT(8),
region_name TEXT ENCODING DICT(8),
street_id TEXT ENCODING DICT(16),
street_name TEXT ENCODING DICT(16),
road_id TEXT ENCODING DICT(32),
road_name TEXT ENCODING DICT(32),
number_plate_id TEXT ENCODING DICT(32),
number_plate_name TEXT ENCODING DICT(32),
cpn_id TEXT ENCODING DICT(32),
cpn_code TEXT ENCODING DICT(32),
community_name TEXT ENCODING DICT(32),
premises_network TEXT ENCODING DICT(32),
building_code TEXT ENCODING DICT(32),
building TEXT ENCODING DICT(32),
unit_id TEXT ENCODING DICT(32),
unit TEXT ENCODING DICT(32),
layer_id TEXT ENCODING DICT(32),
layer TEXT ENCODING DICT(16),
room_id TEXT ENCODING DICT(32),
room TEXT ENCODING DICT(32),
rms_code TEXT ENCODING DICT(32),
substation_id TEXT ENCODING DICT(16),
simple_mark TEXT ENCODING DICT(32),
inner_symbol_id TEXT ENCODING DICT(8),
if_inner TEXT ENCODING DICT(8),
switch_no TEXT ENCODING DICT(16),
type INTEGER ENCODING FIXED(8),
queryindex TEXT ENCODING DICT(32),
remark TEXT ENCODING DICT(16),
alias_name TEXT ENCODING DICT(32),
if_relate TEXT ENCODING DICT(8),
addresstype TEXT ENCODING DICT(8),
source TEXT ENCODING DICT(8),
create_time TIMESTAMP(0) ENCODING FIXED(32),
fldtag TEXT ENCODING DICT(8),
original_company TEXT ENCODING DICT(8),
useremark TEXT ENCODING DICT(8),
village_code TEXT ENCODING DICT(16),
uptimestamp_time TIMESTAMP(0) ENCODING FIXED(32),
create_op TEXT ENCODING DICT(8))
heavysql> CREATE TABLE addressinfo1 (
..> id TEXT ENCODING DICT(32),
..> ods_id TEXT ENCODING DICT(32),
..> code TEXT ENCODING DICT(32),
..> province TEXT ENCODING DICT(8),
..> city TEXT ENCODING DICT(8),
..> regionid TEXT ENCODING DICT(8),
..> region_name TEXT ENCODING DICT(8),
..> street_id TEXT ENCODING DICT(16),
..> street_name TEXT ENCODING DICT(16),
..> road_id TEXT ENCODING DICT(32),
..> road_name TEXT ENCODING DICT(32),
..> number_plate_id TEXT ENCODING DICT(32),
..> number_plate_name TEXT ENCODING DICT(32),
..> cpn_id TEXT ENCODING DICT(32),
..> cpn_code TEXT ENCODING DICT(32),
..> community_name TEXT ENCODING DICT(32),
..> premises_network TEXT ENCODING DICT(32),
..> building_code TEXT ENCODING DICT(32),
..> building TEXT ENCODING DICT(32),
..> unit_id TEXT ENCODING DICT(32),
..> unit TEXT ENCODING DICT(32),
..> layer_id TEXT ENCODING DICT(32),
..> layer TEXT ENCODING DICT(16),
..> room_id TEXT ENCODING DICT(32),
..> room TEXT ENCODING DICT(32),
..> rms_code TEXT ENCODING DICT(32),
..> substation_id TEXT ENCODING DICT(16),
..> simple_mark TEXT ENCODING DICT(32),
..> inner_symbol_id TEXT ENCODING DICT(8),
..> if_inner TEXT ENCODING DICT(8),
..> switch_no TEXT ENCODING DICT(16),
..> type INTEGER ENCODING FIXED(8),
..> queryindex TEXT ENCODING DICT(32),
..> remark TEXT ENCODING DICT(16),
..> alias_name TEXT ENCODING DICT(32),
..> if_relate TEXT ENCODING DICT(8),
..> addresstype TEXT ENCODING DICT(8),
..> source TEXT ENCODING DICT(8),
..> create_time TIMESTAMP(0) ENCODING FIXED(32),
..> fldtag TEXT ENCODING DICT(8),
..> original_company TEXT ENCODING DICT(8),
..> useremark TEXT ENCODING DICT(8),
..> village_code TEXT ENCODING DICT(16),
..> uptimestamp_time TIMESTAMP(0) ENCODING FIXED(32),
..> create_op TEXT ENCODING DICT(8));
heavysql> \t
ADDRESSINFO
SID_AREA
SID_LATN
addressinfo1
heavysql> \timing 
heavysql> insert into addressinfo1 select * from ADDRESSINFO;
heavysql> select count(*) from addressinfo1;
EXPR$0
30748154
1 rows returned.
Execution time: 25 ms, Total time: 28 ms
heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
   206240.20 MB     7202.44 MB     8192.00 MB      989.56 MB

HeavyDB Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]    39986.00 MB     6243.21 MB     8192.00 MB     1948.79 MB

heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-2%';
IDbcode
......1 rows returned.
Execution time: 35 ms, Total time: 37 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路104-2号2-1%';
No rows returned.
Execution time: 963 ms, Total time: 965 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路104-2号2%';
No rows returned.
Execution time: 365 ms, Total time: 367 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路104-2号2-2%';
No rows returned.
Execution time: 359 ms, Total time: 361 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路104-2号2-3%';
No rows returned.
Execution time: 368 ms, Total time: 369 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路104-2号%';
No rows returned.
Execution time: 361 ms, Total time: 362 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路104%';
IDbcode
......
4 rows returned.
Execution time: 27 ms, Total time: 29 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-3%';
IDbcode
......
1 rows returned.
Execution time: 25 ms, Total time: 26 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-0%';
No rows returned.
Execution time: 1786 ms, Total time: 1787 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2%';
IDbcode
。。。。。。
28 rows returned.
Execution time: 884 ms, Total time: 885 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2%';
IDbcode
......
28 rows returned.
Execution time: 881 ms, Total time: 883 ms
heavysql> select ID, code from addressinfo1 where regionid = '210101' and code like '%南四经街104-2号2%';
No rows returned.
Execution time: 25 ms, Total time: 26 ms
heavysql> select ID, code from addressinfo1 where regionid = '210101' and code like '%南四经街104-2号%';
No rows returned.
Execution time: 882 ms, Total time: 883 ms
heavysql> select ID, code from addressinfo1 where regionid = '210101' and code like '%南四经街104%';
No rows returned.
Execution time: 911 ms, Total time: 912 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2%';
IDbcode
。。。。。。
28 rows returned.
Execution time: 28 ms, Total time: 29 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路105%';
IDbcode
。。。。。。
2 rows returned.
Execution time: 886 ms, Total time: 888 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路106%';
IDbcode
......
11 rows returned.
Execution time: 863 ms, Total time: 865 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路104%';
IDbcode
。。。。。。
4 rows returned.
Execution time: 823 ms, Total time: 825 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%中山路103%';
IDbcode
。。。。。。1 rows returned.
Execution time: 811 ms, Total time: 813 ms
heavysql> \d addressinfo
CREATE TABLE ADDRESSINFO (
  id TEXT ENCODING DICT(32),
  ods_id TEXT ENCODING DICT(32),
  code TEXT ENCODING DICT(32),
  province TEXT ENCODING DICT(32),
  city TEXT ENCODING DICT(32),
  regionid TEXT ENCODING DICT(32),
  region_name TEXT ENCODING DICT(32),
  street_id TEXT ENCODING DICT(32),
  street_name TEXT ENCODING DICT(32),
  road_id TEXT ENCODING DICT(32),
  road_name TEXT ENCODING DICT(32),
  number_plate_id TEXT ENCODING DICT(32),
  number_plate_name TEXT ENCODING DICT(32),
  cpn_id TEXT ENCODING DICT(32),
  cpn_code TEXT ENCODING DICT(32),
  community_name TEXT ENCODING DICT(32),
  premises_network TEXT ENCODING DICT(32),
  building_code TEXT ENCODING DICT(32),
  building TEXT ENCODING DICT(32),
  unit_id TEXT ENCODING DICT(32),
  unit TEXT ENCODING DICT(32),
  layer_id TEXT ENCODING DICT(32),
  layer TEXT ENCODING DICT(32),
  room_id TEXT ENCODING DICT(32),
  room TEXT ENCODING DICT(32),
  rms_code TEXT ENCODING DICT(32),
  substation_id TEXT ENCODING DICT(32),
  simple_mark TEXT ENCODING DICT(32),
  inner_symbol_id TEXT ENCODING DICT(32),
  if_inner TEXT ENCODING DICT(32),
  switch_no TEXT ENCODING DICT(32),
  type INTEGER,
  queryindex TEXT ENCODING DICT(32),
  remark TEXT ENCODING DICT(32),
  alias_name TEXT ENCODING DICT(32),
  if_relate TEXT ENCODING DICT(32),
  addresstype TEXT ENCODING DICT(32),
  source TEXT ENCODING DICT(32),
  create_time TIMESTAMP(0),
  fldtag TEXT ENCODING DICT(32),
  original_company TEXT ENCODING DICT(32),
  useremark TEXT ENCODING DICT(32),
  village_code TEXT ENCODING DICT(32),
  uptimestamp_time TIMESTAMP(0),
  create_op TEXT ENCODING DICT(32));
heavysql> \d addressinfo1
CREATE TABLE addressinfo1 (
  id TEXT ENCODING DICT(32),
  ods_id TEXT ENCODING DICT(32),
  code TEXT ENCODING DICT(32),
  province TEXT ENCODING DICT(8),
  city TEXT ENCODING DICT(8),
  regionid TEXT ENCODING DICT(8),
  region_name TEXT ENCODING DICT(8),
  street_id TEXT ENCODING DICT(16),
  street_name TEXT ENCODING DICT(16),
  road_id TEXT ENCODING DICT(32),
  road_name TEXT ENCODING DICT(32),
  number_plate_id TEXT ENCODING DICT(32),
  number_plate_name TEXT ENCODING DICT(32),
  cpn_id TEXT ENCODING DICT(32),
  cpn_code TEXT ENCODING DICT(32),
  community_name TEXT ENCODING DICT(32),
  premises_network TEXT ENCODING DICT(32),
  building_code TEXT ENCODING DICT(32),
  building TEXT ENCODING DICT(32),
  unit_id TEXT ENCODING DICT(32),
  unit TEXT ENCODING DICT(32),
  layer_id TEXT ENCODING DICT(32),
  layer TEXT ENCODING DICT(16),
  room_id TEXT ENCODING DICT(32),
  room TEXT ENCODING DICT(32),
  rms_code TEXT ENCODING DICT(32),
  substation_id TEXT ENCODING DICT(16),
  simple_mark TEXT ENCODING DICT(32),
  inner_symbol_id TEXT ENCODING DICT(8),
  if_inner TEXT ENCODING DICT(8),
  switch_no TEXT ENCODING DICT(16),
  type INTEGER ENCODING FIXED(8),
  queryindex TEXT ENCODING DICT(32),
  remark TEXT ENCODING DICT(16),
  alias_name TEXT ENCODING DICT(32),
  if_relate TEXT ENCODING DICT(8),
  addresstype TEXT ENCODING DICT(8),
  source TEXT ENCODING DICT(8),
  create_time TIMESTAMP(0) ENCODING FIXED(32),
  fldtag TEXT ENCODING DICT(8),
  original_company TEXT ENCODING DICT(8),
  useremark TEXT ENCODING DICT(8),
  village_code TEXT ENCODING DICT(16),
  uptimestamp_time TIMESTAMP(0) ENCODING FIXED(32),
  create_op TEXT ENCODING DICT(8));
heavysql> alter table addressinfo rename to addressinfo_old
..> ;
heavysql> alter table addressinfo1 rename to addressinfo;
heavysql> \t
addressinfo_old
SID_AREA
SID_LATN
addressinfo
heavysql> \d addressinfo
CREATE TABLE addressinfo (
  id TEXT ENCODING DICT(32),
  ods_id TEXT ENCODING DICT(32),
  code TEXT ENCODING DICT(32),
  province TEXT ENCODING DICT(8),
  city TEXT ENCODING DICT(8),
  regionid TEXT ENCODING DICT(8),
  region_name TEXT ENCODING DICT(8),
  street_id TEXT ENCODING DICT(16),
  street_name TEXT ENCODING DICT(16),
  road_id TEXT ENCODING DICT(32),
  road_name TEXT ENCODING DICT(32),
  number_plate_id TEXT ENCODING DICT(32),
  number_plate_name TEXT ENCODING DICT(32),
  cpn_id TEXT ENCODING DICT(32),
  cpn_code TEXT ENCODING DICT(32),
  community_name TEXT ENCODING DICT(32),
  premises_network TEXT ENCODING DICT(32),
  building_code TEXT ENCODING DICT(32),
  building TEXT ENCODING DICT(32),
  unit_id TEXT ENCODING DICT(32),
  unit TEXT ENCODING DICT(32),
  layer_id TEXT ENCODING DICT(32),
  layer TEXT ENCODING DICT(16),
  room_id TEXT ENCODING DICT(32),
  room TEXT ENCODING DICT(32),
  rms_code TEXT ENCODING DICT(32),
  substation_id TEXT ENCODING DICT(16),
  simple_mark TEXT ENCODING DICT(32),
  inner_symbol_id TEXT ENCODING DICT(8),
  if_inner TEXT ENCODING DICT(8),
  switch_no TEXT ENCODING DICT(16),
  type INTEGER ENCODING FIXED(8),
  queryindex TEXT ENCODING DICT(32),
  remark TEXT ENCODING DICT(16),
  alias_name TEXT ENCODING DICT(32),
  if_relate TEXT ENCODING DICT(8),
  addresstype TEXT ENCODING DICT(8),
  source TEXT ENCODING DICT(8),
  create_time TIMESTAMP(0) ENCODING FIXED(32),
  fldtag TEXT ENCODING DICT(8),
  original_company TEXT ENCODING DICT(8),
  useremark TEXT ENCODING DICT(8),
  village_code TEXT ENCODING DICT(16),
  uptimestamp_time TIMESTAMP(0) ENCODING FIXED(32),
  create_op TEXT ENCODING DICT(8));
heavysql> \o addressinfo
CREATE TABLE addressinfo (
id TEXT ENCODING DICT(32),
ods_id TEXT ENCODING DICT(32),
code TEXT ENCODING DICT(32),
province TEXT ENCODING DICT(8),
city TEXT ENCODING DICT(8),
regionid TEXT ENCODING DICT(8),
region_name TEXT ENCODING DICT(8),
street_id TEXT ENCODING DICT(16),
street_name TEXT ENCODING DICT(16),
road_id TEXT ENCODING DICT(32),
road_name TEXT ENCODING DICT(32),
number_plate_id TEXT ENCODING DICT(32),
number_plate_name TEXT ENCODING DICT(32),
cpn_id TEXT ENCODING DICT(32),
cpn_code TEXT ENCODING DICT(32),
community_name TEXT ENCODING DICT(32),
premises_network TEXT ENCODING DICT(32),
building_code TEXT ENCODING DICT(32),
building TEXT ENCODING DICT(32),
unit_id TEXT ENCODING DICT(32),
unit TEXT ENCODING DICT(32),
layer_id TEXT ENCODING DICT(32),
layer TEXT ENCODING DICT(16),
room_id TEXT ENCODING DICT(32),
room TEXT ENCODING DICT(32),
rms_code TEXT ENCODING DICT(32),
substation_id TEXT ENCODING DICT(16),
simple_mark TEXT ENCODING DICT(32),
inner_symbol_id TEXT ENCODING DICT(8),
if_inner TEXT ENCODING DICT(8),
switch_no TEXT ENCODING DICT(16),
type INTEGER ENCODING FIXED(8),
queryindex TEXT ENCODING DICT(32),
remark TEXT ENCODING DICT(16),
alias_name TEXT ENCODING DICT(32),
if_relate TEXT ENCODING DICT(8),
addresstype TEXT ENCODING DICT(8),
source TEXT ENCODING DICT(8),
create_time TIMESTAMP(0) ENCODING FIXED(32),
fldtag TEXT ENCODING DICT(8),
original_company TEXT ENCODING DICT(8),
useremark TEXT ENCODING DICT(8),
village_code TEXT ENCODING DICT(16),
uptimestamp_time TIMESTAMP(0) ENCODING FIXED(32),
create_op TEXT ENCODING DICT(8))
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路106%';
IDbcode
......
11 rows returned.
Execution time: 26 ms, Total time: 28 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路107%';
No rows returned.
Execution time: 3219 ms, Total time: 3220 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路100%';
IDbcode
......1 rows returned.
Execution time: 858 ms, Total time: 859 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路100%';
IDbcode
......1 rows returned.
Execution time: 26 ms, Total time: 27 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路101%';
IDbcode
......1 rows returned.
Execution time: 801 ms, Total time: 802 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路101';
No rows returned.
Execution time: 489 ms, Total time: 491 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路101';
No rows returned.
Execution time: 22 ms, Total time: 22 ms
heavysql> \q
User admin disconnected from database lnlt
heavyai@node13:~$ exit
logout
root@node13:~/heavydb/build-package# systemctl stop heavydb
root@node13:~/heavydb/build-package# systemctl start heavydb
root@node13:~/heavydb/build-package# su - heavyai
heavyai@node13:~$ heavysql -p admin -db lnlt
User admin connected to database lnlt
heavysql> \t
addressinfo_old
SID_AREA
SID_LATN
addressinfo
heavysql> \timing
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路101';
No rows returned.
Execution time: 4512 ms, Total time: 4631 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路101%';
IDbcode
......
1 rows returned.
Execution time: 838 ms, Total time: 904 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路101%';
IDbcode
......
1 rows returned.
Execution time: 32 ms, Total time: 86 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路100%';
IDbcode
......
1 rows returned.
Execution time: 835 ms, Total time: 906 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路100%';
IDbcode
......
1 rows returned.
Execution time: 33 ms, Total time: 34 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路101%';
IDbcode
......
1 rows returned.
Execution time: 34 ms, Total time: 35 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-2%';
IDbcode
......-2号2-1-2
1 rows returned.
Execution time: 803 ms, Total time: 804 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路104-2号2-1%';
No rows returned.
Execution time: 802 ms, Total time: 803 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路104%';
IDbcode
......
4 rows returned.
Execution time: 796 ms, Total time: 798 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-3%';
IDbcode
......-2号2-1-3
1 rows returned.
Execution time: 884 ms, Total time: 885 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-3%';
IDbcode
......
1 rows returned.
Execution time: 36 ms, Total time: 36 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-1%';
IDbcode
......
1 rows returned.
Execution time: 852 ms, Total time: 854 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路105%';
IDbcode
......
2 rows returned.
Execution time: 983 ms, Total time: 984 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%中山路105%';
IDbcode
......
2 rows returned.
Execution time: 29 ms, Total time: 30 ms
heavysql> select count(*) from addressinfo;
EXPR$0
30748154
1 rows returned.
Execution time: 285 ms, Total time: 286 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路102';
No rows returned.
Execution time: 29 ms, Total time: 30 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路103';
No rows returned.
Execution time: 29 ms, Total time: 30 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路104';
No rows returned.
Execution time: 25 ms, Total time: 26 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路105';
No rows returned.
Execution time: 32 ms, Total time: 33 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路';
No rows returned.
Execution time: 29 ms, Total time: 30 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code = '中山路105甲';
IDbcode
......
1 rows returned.
Execution time: 28 ms, Total time: 29 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-1%';
IDbcode
......
58 rows returned.
Execution time: 4733 ms, Total time: 4737 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2%';
IDbcode
。。。。。。
77 rows returned.
Execution time: 4440 ms, Total time: 4443 ms
heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-1%';
IDbcode
......
58 rows returned.
Execution time: 33 ms, Total time: 36 ms

heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
   206240.20 MB      263.91 MB     4096.00 MB     3832.09 MB

HeavyDB Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]    39986.00 MB      146.62 MB     4096.00 MB     3949.38 MB

heavysql> \status
Server Version                      : 6.4.0-20230201-87f6fa37b4 
--------------------------------------------------
Server Name                         : node13
Server Start Time                   : 2023-10-11 : 1:41:43

heavysql> \hardware_info
===========================================
Number of Physical GPUs  :1
Number of CPU core       :32
Number of GPUs allocated :1
Start GPU                :0
-------------------------------------------
Number of SM             :108
Clock frequency          :1376 MHz
Physical GPU Memory      :39 GB
Compute capability       :8.0
-------------------------------------------

===========================================
heavysql> 
cdessanti commented 1 year ago

Hello,

I apologize if I haven't been clear enough regarding the changes to the data types.

To achieve optimal performance, it is necessary to change the data type of the 'code' column from TEXT ENCODED DICT(32) to TEXT ENCODING NONE.

To do this in version 6.4.4, you have two options:

  1. Recreate the table, modifying the 'code' column's data type, and then run an ITAS, just as you did after the optimization.
  2. Execute the following SQL statements.
    ALTER TABLE addressinfo ADD COLUMN code_temp TEXT ENCODING NONE;
    UPDATE addressinfo SET code_temp = code;
    ALTER TABLE addressinfo DROP COLUMN code;
    ALTER TABLE addressinfo RENAME COLUMN code_temp TO code;

Changing the table is this way, queries using 'LIKE' should now run in approximately 20 milliseconds. Please note that the effective runtime will be very low due to the small number of records, but a few milliseconds are required by the parser.

Best Regards, Candido

jieguolove commented 1 year ago

It is indeed faster to change text encoding to none. image

but Under what circumstances should text encoding dict be changed to none? Since none is better, why is none not selected by default when creating table fields? Does none have any shortcomings?

jieguolove commented 1 year ago

The bug problem mentioned earlier was solved in master branch, but we tried to compile and test the latest code and found that bug didn't fix it??? image

heavysql> \version
HeavyDB Server Version: 7.1.0dev-20231011-72c90bc290

heavysql> \d addressinfo1
CREATE TABLE ADDRESSINFO1 (
  id TEXT ENCODING DICT(32),
  ods_id TEXT ENCODING DICT(32),
  code TEXT ENCODING DICT(32),
  province TEXT ENCODING DICT(32),
  city TEXT ENCODING DICT(32),
  regionid TEXT ENCODING DICT(32),
  region_name TEXT ENCODING DICT(32),
  street_id TEXT ENCODING DICT(32),
  street_name TEXT ENCODING DICT(32),
  road_id TEXT ENCODING DICT(32),
  road_name TEXT ENCODING DICT(32),
  number_plate_id TEXT ENCODING DICT(32),
  number_plate_name TEXT ENCODING DICT(32),
  cpn_id TEXT ENCODING DICT(32),
  cpn_code TEXT ENCODING DICT(32),
  community_name TEXT ENCODING DICT(32),
  premises_network TEXT ENCODING DICT(32),
  building_code TEXT ENCODING DICT(32),
  building TEXT ENCODING DICT(32),
  unit_id TEXT ENCODING DICT(32),
  unit TEXT ENCODING DICT(32),
  layer_id TEXT ENCODING DICT(32),
  layer TEXT ENCODING DICT(32),
  room_id TEXT ENCODING DICT(32),
  room TEXT ENCODING DICT(32),
  rms_code TEXT ENCODING DICT(32),
  substation_id TEXT ENCODING DICT(32),
  simple_mark TEXT ENCODING DICT(32),
  inner_symbol_id TEXT ENCODING DICT(32),
  if_inner TEXT ENCODING DICT(32),
  switch_no TEXT ENCODING DICT(32),
  type INTEGER,
  queryindex TEXT ENCODING DICT(32),
  remark TEXT ENCODING DICT(32),
  alias_name TEXT ENCODING DICT(32),
  if_relate TEXT ENCODING DICT(32),
  addresstype TEXT ENCODING DICT(32),
  source TEXT ENCODING DICT(32),
  create_time TIMESTAMP(0),
  fldtag TEXT ENCODING DICT(32),
  original_company TEXT ENCODING DICT(32),
  useremark TEXT ENCODING DICT(32),
  village_code TEXT ENCODING DICT(32),
  uptimestamp_time TIMESTAMP(0),
  create_op TEXT ENCODING DICT(32));
heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
   206240.20 MB     5085.15 MB     8192.00 MB     3106.85 MB

HeavyDB Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]    39986.00 MB     1303.73 MB     4096.00 MB     2792.27 MB

heavysql> select ID, code from addressinfo where regionid = '210102' and code like '%南四经街104-2号2-1-2%';
IDbcode
....-2号2-1-2
1 rows returned.
Execution time: 31 ms, Total time: 33 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1-2%';
IDbcode
....-2号2-1-2
1 rows returned.
Execution time: 920 ms, Total time: 922 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1-3%';
IDbcode
....-2号2-1-3
1 rows returned.
Execution time: 838 ms, Total time: 840 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1-2%';
IDbcode
....-2号2-1-2
1 rows returned.
Execution time: 23 ms, Total time: 25 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1-4%';
IDbcode
....-2号2-1-4
1 rows returned.
Execution time: 768 ms, Total time: 770 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1-5%';
No rows returned.
Execution time: 744 ms, Total time: 745 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1-2%';
IDbcode
....-.-1-2
1 rows returned.
Execution time: 23 ms, Total time: 25 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-1%';
IDbcode
....-4
4 rows returned.
Execution time: 794 ms, Total time: 796 ms

heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-2%';
IDbcode
....-4
4 rows returned.
Execution time: 818 ms, Total time: 821 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code like '%南四经街104-2号2-3%';
IDbcode
....4
4 rows returned.
Execution time: 824 ms, Total time: 826 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code = '南四经街104-2号2-1-2';
IDbcode
....1 rows returned.
Execution time: 594 ms, Total time: 596 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code = '南四经街104-2号2-1-2';
IDbcode
....-2号2-1-2
1 rows returned.
Execution time: 19 ms, Total time: 21 ms
heavysql> select ID, code from addressinfo1 where regionid = '210102' and code = '南四经街104-2号2-1-1';
IDbcode
......
1 rows returned.
Execution time: 17 ms, Total time: 19 ms
cdessanti commented 1 year ago

In response to your questions:

The bug problem mentioned earlier was solved in the master branch, but we tried to compile and test the latest code and found that the bug didn't fix it???

It's included in the master branch of our internal repository, waiting to be included in an upcoming software release.

When should text encoding dictionary be changed to none?

Text encoding with a dictionary is generally preferable for several reasons:

When is NONE encoding preferred?

DICTIONARY encoding is generally less suitable in the following scenarios:

I hope this helps clear up the considerations for choosing between DICTIONARY and NONE encoding for text data. If you have any more questions or need additional information, please don't hesitate to ask.

Candido

jieguolove commented 1 year ago

The answer is very very clear, Thank you very very much!

In response to your questions:

The bug problem mentioned earlier was solved in the master branch, but we tried to compile and test the latest code and found that the bug didn't fix it???

It's included in the master branch of our internal repository, waiting to be included in an upcoming software release.

When should text encoding dictionary be changed to none?

Text encoding with a dictionary is generally preferable for several reasons:

  • Memory Efficiency: Dictionary encoding typically consumes less memory than NONE encoding. For example, consider a table with a billion records and a "Country" field. Assuming there are fewer than 255 unique countries worldwide, with dictionary encoding, only one byte is required to store each country. This results in significant memory and disk savings.
  • Grouping Efficiency: Depending on the hash algorithm used, the index value can be employed for grouping without additional hashing, enhancing performance.
  • Query Efficiency: Only a set of distinct values needs to be scanned when used in WHERE conditions. Operations like equality checks, pattern matching (e.g., LIKE), and others are performed on the dictionary, reducing the search space from a billion to just a few hundred strings. Subsequent matching indexes are then compared as numbers, improving query performance.

When is NONE encoding preferred?

DICTIONARY encoding is generally less suitable in the following scenarios:

  • High Cardinality: When the cardinality of the TEXT in the tables is exceptionally high, dictionary encoding may become an overhead rather than an advantage.
  • Large Text Fields: If the TEXT content in the fields exceeds 32K in size, using NONE encoding is recommended, as dictionary encoding may not be practical in such cases.

I hope this helps clear up the considerations for choosing between DICTIONARY and NONE encoding for text data. If you have any more questions or need additional information, please don't hesitate to ask.

Candido