Open hugues31 opened 5 months ago
Add one or multiple options to allow user specify a strategy to split the dataset among multiple files.
It could be great for example to have :
info: output_name: test output_format: parquet rows: 2_000_000 files: 5
So each file will contains approx. 2M/5 = 400k rows.
We could have parameters like:
files
target_size
Add one or multiple options to allow user specify a strategy to split the dataset among multiple files.
It could be great for example to have :
So each file will contains approx. 2M/5 = 400k rows.
We could have parameters like:
files
: described abovetarget_size
: split when the file is above a certain threshold (to test HDFS optimal block size for example)