jdbcx / jdbcx

JDBCX: Extended JDBC driver for dynamic multi-language queries with optional bridge server for federated datasource connectivity.
https://jdbcx.github.io/
Apache License 2.0
17 stars 1 forks source link

Is it possible to speed up the processing? #66

Open ruslanen opened 1 month ago

ruslanen commented 1 month ago

Hi @zhicwu I've made simple comparison between different ways of creating table from remote PostgreSQL source:

create or replace table postgresql_jdbc_source engine = MergeTree order by tuple() as
select * from jdbc('jdbc:postgresql://ip:5432/microsoft_contoso_integration_100k?user=postgres&password=', 'select * from factofflinesalesdenormalized');
-- 0 rows in set. Elapsed: 8.700 sec.

create or replace table postgresql_native_source engine = MergeTree order by tuple() as
select * from postgresql('ip:5432','microsoft_contoso_integration_100k','factofflinesalesdenormalized','postgres','');
-- 0 rows in set. Elapsed: 3.144 sec.

create or replace table postgresql_jdbcx_source engine = MergeTree order by tuple() as
select *
from url('http://ip:8080/query?q=%7B%7Bdb%28url%3Djdbc%3Apostgresql%3A%2F%2Fip%3A5432%2Fmicrosoft_contoso_integration_100k%3Fuser%3Dpostgres%26password%%29%3Aselect%20%2A%20from%20factofflinesalesdenormalized%7D%7D', CSV)
0 rows in set. Elapsed: 12.784 sec.

Table contains ~170 cols, 100K rows, ~180 Mb data. jdbcx slower than others. It it possible to tune jdbcx? Maybe changing parsing format from CSV to other or something else. And another question: what will be with clickhouse-jdbc-bridge? It is out of support. Will jdbcx replace it? Thank you!

zhicwu commented 1 month ago

Hi @ruslanen, thanks for sharing the test results.

Just tested against a live table with 20 columns and over 500K rows. The resulting table in ClickHouse occupies approximately 125MB of disk space.

-- 1. CSV with column definitions
-- 19s, 578,735 rows
-- 18s, 578,736 rows
-- 18s, 578,736 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM url('http://<bridge server>/query?f=csv&q=%7B%7B+db.my-postgres%3A+SELECT+%2A+FROM+my_table+%7D%7D', CSVWithNames, '<column definitions>')

-- 2. CSV without column definitions - ClickHouse will send the same query twice to bridge server
-- 24s, 578,738 rows
-- 26s, 578,739 rows
-- 25s, 578,743 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM url('http://<bridge server>/query?f=csv&q=%7B%7B+db.my-postgres%3A+SELECT+%2A+FROM+my_table+%7D%7D', CSVWithNames)

-- 3. ArrowStream with column definitions
-- 16s, 578,744 rows
-- 16s, 578,747 rows
-- 14s, 578,749 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM url('http://<bridge server>/query?f=arrows&q=%7B%7B+db.my-postgres%3A+SELECT+%2A+FROM+my_table+%7D%7D', ArrowStream, '<column definitions>')

-- 4. RowBinary
-- 16s, 578,766 rows
-- 16s, 578,766 rows
-- 17s, 578,766 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM jdbc('my-postgres', 'SELECT * FROM my_table')

-- 5. ArrowStream (similar to Q3)
-- 14s, 578,769 rows
-- 15s, 578,770 rows
-- 14s, 578,772 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM {{ bridge(mode=a,format=arrows): {{ db.my-postgres: SELECT * FROM my_table \}} }}

-- 6. ArrowStream(compressed using zstd)
-- 15s, 578,775 rows
-- 15s, 578,778 rows
-- 15s, 578,780 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM {{ bridge(mode=a,format=arrows,compression=zstd): {{ db.my-postgres: SELECT * FROM my_table \}} }}

-- 7. CSV (similar to Q1)
-- 18s, 578,780 rows
-- 19s, 578,780 rows
-- 18s, 578,780 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM {{ table.db.my-postgres: SELECT * FROM my_table }}

Basically, to achieve better performance: 1) it's recommended to specify column definition when using url table function, or ClickHouse may trigger the same query multiple times, which is why Q2 is slower than Q1 2) consider binary data format especially ArrowStream, if you don't have complex data types like array etc. 3) consider compression for cross-datacenter queries

I'm not aware of the current status of clickhouse-jdbc-bridge support, as it ultimately depends on the company and community. I've switched to JDBCX a while ago for several reasons:

Regarding your specific question, when it comes to dumping a table from PostgreSQL to ClickHouse, it's generally more efficient to utilize the native postgresql function for optimal performance. However, if you need to access multiple data sources, perform data slicing, or require more runtime flexibility, JDBCX, Trino, or JDBC-Bridge may be necessary.

ruslanen commented 1 month ago

Big thanks for your research!