Open ruslanen opened 1 month ago
Hi @ruslanen, thanks for sharing the test results.
Just tested against a live table with 20 columns and over 500K rows. The resulting table in ClickHouse occupies approximately 125MB of disk space.
-- 1. CSV with column definitions
-- 19s, 578,735 rows
-- 18s, 578,736 rows
-- 18s, 578,736 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM url('http://<bridge server>/query?f=csv&q=%7B%7B+db.my-postgres%3A+SELECT+%2A+FROM+my_table+%7D%7D', CSVWithNames, '<column definitions>')
-- 2. CSV without column definitions - ClickHouse will send the same query twice to bridge server
-- 24s, 578,738 rows
-- 26s, 578,739 rows
-- 25s, 578,743 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM url('http://<bridge server>/query?f=csv&q=%7B%7B+db.my-postgres%3A+SELECT+%2A+FROM+my_table+%7D%7D', CSVWithNames)
-- 3. ArrowStream with column definitions
-- 16s, 578,744 rows
-- 16s, 578,747 rows
-- 14s, 578,749 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM url('http://<bridge server>/query?f=arrows&q=%7B%7B+db.my-postgres%3A+SELECT+%2A+FROM+my_table+%7D%7D', ArrowStream, '<column definitions>')
-- 4. RowBinary
-- 16s, 578,766 rows
-- 16s, 578,766 rows
-- 17s, 578,766 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM jdbc('my-postgres', 'SELECT * FROM my_table')
-- 5. ArrowStream (similar to Q3)
-- 14s, 578,769 rows
-- 15s, 578,770 rows
-- 14s, 578,772 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM {{ bridge(mode=a,format=arrows): {{ db.my-postgres: SELECT * FROM my_table \}} }}
-- 6. ArrowStream(compressed using zstd)
-- 15s, 578,775 rows
-- 15s, 578,778 rows
-- 15s, 578,780 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM {{ bridge(mode=a,format=arrows,compression=zstd): {{ db.my-postgres: SELECT * FROM my_table \}} }}
-- 7. CSV (similar to Q1)
-- 18s, 578,780 rows
-- 19s, 578,780 rows
-- 18s, 578,780 rows
CREATE OR REPLACE TABLE ttt_pr engine=MergeTree ORDER BY tuple() AS
SELECT *
FROM {{ table.db.my-postgres: SELECT * FROM my_table }}
Basically, to achieve better performance:
1) it's recommended to specify column definition when using url
table function, or ClickHouse may trigger the same query multiple times, which is why Q2 is slower than Q1
2) consider binary data format especially ArrowStream
, if you don't have complex data types like array etc.
3) consider compression for cross-datacenter queries
I'm not aware of the current status of clickhouse-jdbc-bridge support, as it ultimately depends on the company and community. I've switched to JDBCX a while ago for several reasons:
Regarding your specific question, when it comes to dumping a table from PostgreSQL to ClickHouse, it's generally more efficient to utilize the native postgresql function for optimal performance. However, if you need to access multiple data sources, perform data slicing, or require more runtime flexibility, JDBCX, Trino, or JDBC-Bridge may be necessary.
Big thanks for your research!
Hi @zhicwu I've made simple comparison between different ways of creating table from remote PostgreSQL source:
Table contains ~170 cols, 100K rows, ~180 Mb data. jdbcx slower than others. It it possible to tune jdbcx? Maybe changing parsing format from CSV to other or something else. And another question: what will be with clickhouse-jdbc-bridge? It is out of support. Will jdbcx replace it? Thank you!