[Feature Request] parallel bulk file import

hatarist / clickhouse-cli

A third-party client for the Clickhouse DBMS server.

Other

252 stars 43 forks source link

[Feature Request] parallel bulk file import #11

Open inkrement opened 7 years ago

inkrement commented 7 years ago

I often have to import "huge" files and it is suggested to import such files in parallel. This can be "easily" done using tools like GNU parallel, but in most cases it is not so easy to do it right (for example GNUs parallel's --pipe option should not be used, because it is slow; someone should use pigz instead of gzip, which does not use multiple cores).

So I would really really like to see an new option to simply import large files and the program automatically uses multiple connections or even uncompresses files on the fly. This would be also a benefit for all platforms that support python, but not GNU parallel.

hatarist commented 7 years ago

even uncompresses files on the fly There's better! ClickHouse supports gzip and zlib/deflate compression for the incoming data, so I added .gz file support - the client will send the gzipped data.

$ clickhouse-cli -q 'INSERT INTO data FORMAT CSV' ~/data.csv.gz

I didn't test (yet) how well it would behave on remote servers though. I suppose it should send the data via the network faster.

About parallel support - I suppose as for now you could just run multiple client instances. I don't see any reason for it to be implemented right now because I internally use requests, which doesn't get well with async/threaded stuff. I'm planning to migrate to aiohttp soon.

inkrement commented 7 years ago

Ah, so it is possible to send the whole file directly to the server, and the server parses/imports it directly? or is it just another way to express "cat file.csv | clickhouse-cli ...."?

hatarist commented 7 years ago

Nope, it's not possible to send the file directly as, well, the file (as in multipart/form-data) since it will handle that as a file that is to be uploaded in a temporary table.

clickhouse-cli reads the whole file (regardless of how was it sent - via stdin (cat file.tsv | clickhouse-cli -q 'insert into table format tsv') or as an argument (clickhouse-cli -q 'insert into table format tsv' file.tsv), it reads the file content and sends it to the server.

The server, then, parses it directly. clickhouse-cli doesn't do anything (doesn't parse it), just reads the file content and passes it to the server. Now (in 0.2.2) it's able to read the .gz archives. It reads the archive, doesn't unpack it and sends it to the server with a note (a header Content-Encoding: gzip, actually) that it's archived. The server then unpacks, parses and imports it.

inkrement commented 7 years ago

ah ok, I got it now.