postgrespro / vops

Other
166 stars 22 forks source link

Performance of populate function #11

Closed Mad-Apes closed 12 months ago

Mad-Apes commented 6 years ago

Hello, Through recent practice, I found out that query performance has greatly improved, but populate function consumes a lot of time. The populate function will take more than 200 seconds when normal table contains 2000,000 rows. So, what should I do to make it better?

knizhnik commented 6 years ago

In principle, you can populate data in parallel by spawning several concurrent populate statements with different predicates:

select populate(destination:='vops_table_gb'::regclass, source:='std_table'::regclass, sort := 'column_int2,column_int4', predicate:='column_int2 >= 0 and column_int2 < 1');
select populate(destination:='vops_table_gb'::regclass, source:='std_table'::regclass, sort := 'column_int2,column_int4', predicate:='column_int2 >= 1 and column_int2 < 2');
...

It will allow to load all CPU cores and perform import up to N times faster. But please take in account that performing parallel insert will violate sequential order of records in vops_table_gb table which may have negative impact on performance of subsequent select queries. This was one of the reasons why I have not implemented in VOPS parallel load.