Open corneliusroemer opened 1 year ago
For now you can use parquet-fromcsv
from arrow-rs to convert CSV/TSV files to parquet.
This supports compressed (uncompressed
, snappy
, gzip
brotli
lz4`zstd
) files only relatively recently: https://github.com/apache/arrow-rs/issues/3721
parquet-fromcsv
is also useful in general to read compressed CSV/TSV files and convert them to parquet, so you can use pl.scan_parquet
on very big files.
# Clone Rust arrow repo.
git clone https://github.com/apache/arrow-rs
# Build arrow CLI tools.
cargo build --release --features=cli
# Current list of Arrow CLI tools:
❯ file target/release/*|grep ELF
target/release/arrow-file-to-stream: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/arrow-json-integration-test: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/arrow-stream-to-file: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/flight-test-integration-client: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/flight-test-integration-server: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/gen: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/libparquet_derive.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, with debug_info, not stripped, too many notes (256)
target/release/parquet-concat: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-fromcsv: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-index: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-layout: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-read: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-rewrite: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-rowcount: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-schema: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-show-bloom-filter: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
# Use parquet-fromcsv to convert CSV/TSV file to parquet, which you then can use with Polars.
$ target/release/parquet-fromcsv --help
Binary to convert csv to Parquet
Usage: parquet-fromcsv [OPTIONS] --schema <SCHEMA> --input-file <INPUT_FILE> --output-file <OUTPUT_FILE>
Options:
-s, --schema <SCHEMA>
message schema for output Parquet
-i, --input-file <INPUT_FILE>
input CSV file
-o, --output-file <OUTPUT_FILE>
output Parquet file
-f, --input-format <INPUT_FORMAT>
input file format
[default: csv]
[possible values: csv, tsv]
-b, --batch-size <BATCH_SIZE>
batch size
[env: PARQUET_FROM_CSV_BATCHSIZE=]
[default: 1000]
-h, --has-header
has header
-d, --delimiter <DELIMITER>
field delimiter
default value: when input_format==CSV: ',' when input_format==TSV: 'TAB'
-r, --record-terminator <RECORD_TERMINATOR>
record terminator
[possible values: lf, crlf, cr]
-e, --escape-char <ESCAPE_CHAR>
escape character
-q, --quote-char <QUOTE_CHAR>
quote character
-D, --double-quote <DOUBLE_QUOTE>
double quote
[possible values: true, false]
-C, --csv-compression <CSV_COMPRESSION>
compression mode of csv
[default: UNCOMPRESSED]
-c, --parquet-compression <PARQUET_COMPRESSION>
compression mode of parquet
[default: SNAPPY]
-w, --writer-version <WRITER_VERSION>
writer version
-m, --max-row-group-size <MAX_ROW_GROUP_SIZE>
max row group size
--enable-bloom-filter <ENABLE_BLOOM_FILTER>
whether to enable bloom filter writing
[possible values: true, false]
--help
display usage help
-V, --version
Print version
# Example: Reading gzipped TSV file and converting to parquet.
$ parquet-fromcsv \
--schema fragments.schema \
--input-format tsv \
--csv-compression gzip \
--parquet-compression zstd \
--input-file fragments.raw.tsv.gz \
--output-file fragments.parquet
# To get the schema file:
# - get a small uncompressed part of the compressed CSV/TSV file
# - Read the file with Polars and write to parquet.
# - Get schema from parquet file with parquet-schema
$ zcat fragments.raw.tsv.gz | head -n 10000 > /tmp/fragments.head10000.tsv
pl.read_csv("/tmp/fragments.head10000.tsv", separator="\t", has_header=False).write_parquet("/tmp/fragments.head1000.parquet")
$ parquet-schema /tmp/fragments.head10000.parquet | grep -A 1000000 '^message' > fragments.schema
$ cat fragments.schema
message root {
OPTIONAL BYTE_ARRAY column_1 (STRING);
OPTIONAL INT64 column_2;
OPTIONAL INT64 column_3;
OPTIONAL BYTE_ARRAY column_4 (STRING);
OPTIONAL INT64 column_5;
}
Very useful, thanks!
Can we use this on a batch of files?
PS: Note that there's a typo in the number of zeros:
Should be consistently 1000
or 10000
Can we use this on a batch of files?
With GNU parallel it is quite easy to run a similar command with different arguments.
# Make a list of all files you want to process (`ls -1` or `cat file_list`)
# Run 4 conversions at the same time (`-j 4`), where `{}` represents a certain line (a `tsv.gz` file in this case) and `{..}` the same minus the `.tsv.gz` part at the end (strip extension twice).
ls -1 *.tsv.gz \
| parallel -j 4 --plus \
parquet-fromcsv \
--schema fragments.schema \
--input-format tsv \
--csv-compression gzip \
--parquet-compression zstd \
--input-file {} \
--output-file {..}.parquet
PS: Note that there's a typo in the number of zeros: Should be consistently
1000
or10000
Fixed
Problem description
I wish I could read zstd compressed csv files using
read_csv
and ideally alsoscan_csv
. See https://stackoverflow.com/questions/76417610/how-to-read-csv-a-zstd-compressed-file-using-python-polars/76420788#76420788Possibly related:
9266