skale-me / node-parquet

NodeJS module to access apache parquet format files
Apache License 2.0
57 stars 11 forks source link

Is it possible to upload stream upload direct AWS S3? #61

Open rkbsoftsolutions opened 4 years ago

rkbsoftsolutions commented 4 years ago

I have too large amount no-sql data, I want to read data as stream and just pass the schema and stream . I will upload on S3 as parquet file . Due to large amount data can't store on local so I don't want to store file in memory or physically memory . Please advise me

mazki555 commented 3 years ago

just for info, if someone reads this issue:

you can't do this with parquet. it's not related to this library it's related to the structure of the format which requires you to know everything as you build the parquet file (in memory). it's also depends on the parquet-writer. I don't know any parquet-writer which knows to build the file in chunks.

parquet compresses data as you write to the library but you need extra memory for reading your records and write them to parquet.

parquet compression ratio is very high so if you write a 512mb parquet file it means you are going to process lot's of data!

In theory, you only need enough memory for the parquet file size + your buffer records but any parquet writer has some overhead of memory. so you need to take this into account.

In general, for writing parquet files you needs lot's of memory but the list of benefits of using parquet is very high (reading performance, network traffic reduce, small file sizes)

if you are really tight on memory / budget you should consider moving to csv+gzip / avro / csv / json-lines which you can stream chunks and use very low memory footprint.

SimonJang commented 3 years ago

I don't think it's possible to stream to S3 unless you know the exact file size when calling the S3 API, which is something you don't really know when mutating the source data in your stream.

mazki555 commented 3 years ago

@SimonJang are you sure ? it's says here there is no problem uploading streams to s3. you don't need to know the file size ahead of time ..

julien-c commented 3 years ago

What about (streaming) reading over HTTP? is this supported?