ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

Example code for a simple node stream + buffer #88

Open tobinbc opened 5 years ago

tobinbc commented 5 years ago

There isn't anything too obvious about writing to streams in the readme that I could see so thought it'd be useful for others too :)

kessler commented 4 years ago

@tobinbc Thank you very much for the example. We're working on a new version of parquetjs which will support streams out of the box - this rewrite will take several months though, given the amount of time we have for this project atm. So I will gladly add your contribution to the docs in the meantime. Could you please sign the cla agreement here: https://github.com/ironSource/opensource-contributor-license-agreement

Thanks

dobesv commented 4 years ago

Is there any way I could help with that rewrite, is it in a branch? Is there a process to contribute to it?

kessler commented 4 years ago

@dobesv we can certainly discuss it, can you email me? image

dgendill commented 2 years ago

For those wanting to read parquet files outside the file system, I've found that this fork provides a good example of extending the ParquetEnvelopeReader to read from different sources, namely reading from a Buffer, From S3, or from a URL.

https://github.com/LibertyDSNP/parquetjs/blob/v1.2.0/lib/reader.ts#L378

That code has slightly deviated from the original ParquetEnvelopeReader which can be found here:

https://github.com/ironSource/parquetjs/blob/v0.8.0/lib/reader.js#L191

But the big idea is mostly the same. If you provide implementations of these functions you can create your own custom ParquetEnvelopeReader.

/*
readFn: (offset: number, length: number) => Promise<Buffer>
close: () => void;
fileSize: number;
*/
const myReader = new ParquetEnvelopeReader(readFn, closeFn, fileStat.size);

I have yet to implement this myself, but it seems reasonable that this could be extended to support a generic NodeJs Readable Stream such as the one provided by BlobDownloadResponseParsed.readablestreambody used in @azure/storage-blob

Being able to use a generic ReadableStream would also be a solution to fix this issue and open up the possibility of interfacing with other cloud services: https://github.com/ironSource/parquetjs/issues/110