postgrespro / vops

Other
166 stars 22 forks source link

Error on load data in VOPS table:row is too big #9

Closed Mad-Apes closed 6 years ago

Mad-Apes commented 6 years ago

Error on load data in VOPS table:row is too big

Hello, When I load data in VOPS table using VOPS populate(...) function and VOPS import(...) function,An error occurred : VOPS=# select populate(destination:='vops_table'::regclass, source:='std_table'::regclass); ERROR: row is too big: size 16392, maximum size 8160

VOPS=# select import(destination := 'vops_table'::regclass, csv_path := '/data/vops_data/std_table.csv', separator := '|'); ERROR: row is too big: size 16392, maximum size 8160 CONTEXT: IMPORT vops_table, line 65, column 63

'std_table' is a normal table which contains 64 columns.The type of all columns are int(int2、int4、int8) or float(float4、float8).

I tried to solve the problem,but the methods that I tried did not work.

So how to solve this problem ?.

best regards, thank you.

knizhnik commented 6 years ago

VOPS types are declared with PLAIN storage type, so Postgres is not able to move this data to TOAST storage. I can certainly change storage type to EXTENDED, but it will require all serialization routes for this type (because them now will have varying length rather the fixed length representation). But what is more important, moving data to some other (TOAST) page will significantly slow down access time and even somehow devalue idea of VOPS with vector processing.

So you really need all 64 columns in your query? If not, then my recommendation is to create several different projections with smaller subset of attributes. It can also significantly increase speed of queries involving just small subset of columns.

Yes another solution is increasing Postgres page size (form default 8kb). It will require rebuilding of Postgres from sources and most likely you will not like to do it.

And if none of the suggested alternatives are applicable in you case, I can provide you version of VOPS where VOPS types are defined with EXTENDED storage type so that you can check performance in this case.

Mad-Apes commented 6 years ago

Thanks for the reply. We have 100+ normal tables, and each table contains tens of columns. Most of the columns will be used in analytic queries.

According to your suggestion. I tried to split a large VOPS table into several small VOPS tables with smaller subset of attributes and I use VOPS FDW. I created FDW tables by the small VOPS tables which contains smaller subset of attributes. load data in VOPS tables, The error did't occurred any more. Then analyze for FDW tables. I execute query on every VOPS table,and combine each result into one by 'inner join'. It works.

There is a problem that bothers me. There are more then 500,000 rows in the table. The execution time of query with VOPS is about the same as the normal one. Maybe I did something wrong in somewhere.

My steps:

  1. create VOPS tables with smaller subset of attributes.
  2. load data by VOPS populate function
  3. create FDW tables by VOPS tables
  4. analyze FDW tables
  5. execute query Here is the query: SELECT * FROM( SELECT MAX(fdw_table_1.id) AS id . . . ,MIN(fdw_table_1.time) AS time ,AVG(fdw_table_1.received_num) AS enb_received_power . . . ,MAX(fdw_table_1.sequence_id) AS sequence_id ,count(*) AS xdr_count ,cast(null as bigint) AS align FROM vops_fdw_table_1 fdw_table_1 GROUP BY fdw_table_1.id, fdw_table_1.align ) as fdw_1 inner join ( SELECT MAX(fdw_table_2.sequence_id) . . . FROM vops_fdw_table_2 fdw_table_2 GROUP BY fdw_table_2.id, fdw_table_2.align ) fdw_2 on fdw_1.sequence_id = fdw_2.sequence_id If there are wrong, please point them out. Thank you very much.

I would be very grateful if you provide the version of VOPS where VOPS types are defined with EXTENDED storage type.

knizhnik commented 6 years ago

If you always perform grouping by sequence_id, then it is desirable to leave sequence_id as scalar type (do not replace it with VOPS type). In this case grouping will be done by Postgres is standard way while aggregates will be calculated using VOPS functions. Otherwise VOPS FDW will have to extract horizontal tuples from VOPS types, after which grouping and aggregation will be done using standard Postgres executor nodes. So there will be no performance benefit except extra overhead of unnesting VOPS tiles into horizontal tuples.

Also I do not completely understand idea of grouping by dummy align field which actually represents null value.

knizhnik commented 6 years ago

Please find version of the VOPS where VOPS types have extended storage class in "extended" branch in GIT.

Mad-Apes commented 6 years ago

First of all, align is the field in table vops_fdw_table_1 and vops_fdw_table_2. The file in this two tables is not null. During the statistical process, I need a bigint data values null, and I give it an alias named align. I sincerely apologize to you.

OK,Next, I will do my work in these two ways and compare the performance between them.

Thanks again

Mad-Apes commented 6 years ago

Hello,

I met a new problem. data lost after performing populate function.

  1. VOPS=# SELECT column_int2, column_int4, column_int8 FROM std_table WHERE column_int2 = 1 AND column_int4 = 0;

    column_int2 | column_int4 | column_int8 -------------+-------------+------------- 1 | 0 | 762 1 | 0 | 722 1 | 0 | 497 1 | 0 | 4 1 | 0 | 472 1 | 0 | 312 1 | 0 | 504 1 | 0 | 808 (8 rows)

  2. select populate(destination:='vops_table_gb'::regclass, source:='std_table'::regclass, sort := 'column_int2,column_int4');

3.VOPS=# SELECT column_int2, column_int4, column_int8 FROM vops_table_gb WHERE column_int2 = 1 AND column_int4 = 0;

column_int2 | column_int4 | column_int8
-------------+-------------+---------------------------------------------------------------------------- 1 | 0 | {504,497,762,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} (1 row)

'column_int2' and 'column_int4' are scalar type in vops_table_gb .

knizhnik commented 6 years ago

Sorry, I can not reproduce the problem:

postgres=# create table std_table(column_int2 int2, column_int4 int4, column_int8 int8);
CREATE TABLE
postgres=# insert into std_table values (1,0,762),(1,0,722),(1,0,497),(1,0,4),(1,0,472),(1,0,312),(1,0,504),(1,0,808);
INSERT 0 8
postgres=# SELECT column_int2, column_int4, column_int8 FROM std_table WHERE column_int2 = 1 AND column_int4 = 0;
 column_int2 | column_int4 | column_int8 
-------------+-------------+-------------
           1 |           0 |         762
           1 |           0 |         722
           1 |           0 |         497
           1 |           0 |           4
           1 |           0 |         472
           1 |           0 |         312
           1 |           0 |         504
           1 |           0 |         808
(8 rows)

postgres=# create table vops_table_gb(column_int2 int2, column_int4 int4, column_int8 vops_int8);
CREATE TABLE
postgres=#     select populate(destination:='vops_table_gb'::regclass, source:='std_table'::regclass, sort := 'column_int2,column_int4');
 populate 
----------
        8
(1 row)

postgres=# SELECT column_int2, column_int4, column_int8 FROM vops_table_gb WHERE column_int2 = 1 AND column_int4 = 0;
 column_int2 | column_int4 |                                       column_int8 

-------------+-------------+---------------------------------------------------
--------------------------------------
           1 |           0 | {762,722,497,4,472,312,504,808,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}
(1 row)

Can you send mew some example, reproducing the problem?

Mad-Apes commented 6 years ago

Hello,Here is what I did:

1: VOPS=# create table std_table(pk_column int2,column_int2 int2, column_int4 int4, column_int8 int8); CREATE TABLE 2: VOPS=# INSERT INTO std_table VALUES(generate_series(1,10),random()*5,random()*5,random()*100); INSERT 0 10 3: VOPS=# create table vops_table_gb(pk_column vops_int2,column_int2 int2, column_int4 int4, column_int8 vops_int8); CREATE TABLE 4: VOPS=# select populate(destination:='vops_table_gb'::regclass, source:='std_table'::regclass, sort := 'column_int2,column_int4'); populate ---------- 10 (1 row) 5: VOPS=# SELECT column_int2 ,column_int4 ,sum(column_int8) FROM std_table GROUP BY column_int2,column_int4 ORDER BY column_int2,column_int4; column_int2 | column_int4 | sum -------------+-------------+----- 0 | 1 | 81 1 | 1 | 69 2 | 0 | 45 2 | 5 | 166 3 | 0 | 26 3 | 2 | 94 4 | 1 | 49 5 | 2 | 46 (8 rows) 6: VOPS=# SELECT column_int2 ,column_int4 ,sum(column_int8) FROM vops_table_gb GROUP BY column_int2,column_int4 ORDER BY column_int2,column_int4; column_int2 | column_int4 | sum -------------+-------------+----- 0 | 1 | 81 1 | 1 | 69 2 | 0 | 45 2 | 5 | 40 3 | 0 | 26 3 | 2 | 94 4 | 1 | 49 5 | 2 | 46 (8 rows)

Look, The Result is different where column_int2=2 and column_int4=5. So, I check the data from each table.

1: VOPS=# SELECT * FROM std_table where column_int2=2 and column_int4=5; pk_column | column_int2 | column_int4 | column_int8 -----------+-------------+-------------+------------- 4 | 2 | 5 | 73 6 | 2 | 5 | 53 10 | 2 | 5 | 40 (3 rows) 2: VOPS=# SELECT * FROM vops_table_gb where column_int2=2 and column_int4=5; pk_column | column_int2 | column_int4 | column_int8
---------------------------------------------------------------------+-------------+-------------+--------------------------------------------------------------------- {10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} | 2 | 5 | {40,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} (1 row)

Mad-Apes commented 6 years ago

If you haven’t reproduced the problem. You can try generate more rows to insert into std_table.

knizhnik commented 6 years ago

Thank you very much for helping me to reproduce and investigate the problem. It is really bug in handling duplicate values in vops_populate. Fixed in commits e91c75d and 6728123.