Open rkbsoftsolutions opened 4 years ago
just for info, if someone reads this issue:
you can't do this with parquet. it's not related to this library it's related to the structure of the format which requires you to know everything as you build the parquet file (in memory). it's also depends on the parquet-writer. I don't know any parquet-writer which knows to build the file in chunks.
parquet compresses data as you write to the library but you need extra memory for reading your records and write them to parquet.
parquet compression ratio is very high so if you write a 512mb parquet file it means you are going to process lot's of data!
In theory, you only need enough memory for the parquet file size + your buffer records but any parquet writer has some overhead of memory. so you need to take this into account.
In general, for writing parquet files you needs lot's of memory but the list of benefits of using parquet is very high (reading performance, network traffic reduce, small file sizes)
if you are really tight on memory / budget you should consider moving to csv+gzip / avro / csv / json-lines which you can stream chunks and use very low memory footprint.
I don't think it's possible to stream to S3 unless you know the exact file size when calling the S3 API, which is something you don't really know when mutating the source data in your stream.
@SimonJang are you sure ? it's says here there is no problem uploading streams to s3. you don't need to know the file size ahead of time ..
What about (streaming) reading over HTTP? is this supported?
I have too large amount no-sql data, I want to read data as stream and just pass the schema and stream . I will upload on S3 as parquet file . Due to large amount data can't store on local so I don't want to store file in memory or physically memory . Please advise me