ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

Help Required : What is the best way of writing 10000 records at once ? #86

Open NaveenShanmugam opened 5 years ago

mytototo commented 5 years ago

I'm trying to accomplish a similar thing. Anyone has an idea?

Thanks.

asmuth commented 5 years ago

I think the best strategy would be to create a ParquetWriter and then repeatedly call appendRow on it.

The writer has a setting called rowGroupSize that controls how many rows are buffered in memory before a disk flush is performed. See https://github.com/ironSource/parquetjs#buffering--row-group-size and https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L96

The best value for the row group size depends on your input data and maybe it's best to choose it experimentally. A too small row group size will result in reduced compression efficiency and an increased filesize while increasing the value also increases peak memory usage of the writer. I would start with something like 8192 and see how it goes.

dani-pisca commented 3 years ago

Anyway the flush seems to be performed by the ParquetWriter but not from the WriteStream. The parquet file on local disk will be 1 KB until when the stream closes (i think this is because of the NodeJs write stream behaviour). Has anybody found a workaround?