ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
347 stars 175 forks source link

ParquetFileWriter doesn't exist, and how to write to stream #22

Closed deitch closed 6 years ago

deitch commented 6 years ago

According to the docs, you start with:

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = new parquet.ParquetFileWriter(schema, 'fruits.parquet');

Yet, no such parquet.ParquetFileWrite exists:

> p = require('parquetjs')
{ ParquetEnvelopeReader: [Function: ParquetEnvelopeReader],
  ParquetReader: [Function: ParquetReader],
  ParquetEnvelopeWriter: [Function: ParquetEnvelopeWriter],
  ParquetWriter: [Function: ParquetWriter],
  ParquetSchema: [Function: ParquetSchema],
  ParquetShredder: { shredRecord: [Function], materializeRecords: [Function] } }
> p.ParquetFileWriter
undefined

As it is, I came across it looking for a way to get it to write directly to a stream, rather than to a file.

asmuth commented 6 years ago

Yes we need to update the README! The class is just called 'ParquetWriter' now, you can see an example using it here: https://github.com/ironSource/parquetjs/blob/master/examples/writer.js#L16

To write directly to a stream with the current version, you have to first construct a ParquetEnvelopeWriter for your stream (see https://github.com/ironSource/parquetjs/pull/19/files#diff-a3be12e47bf4eaa0075c39ab068184b8R155), and then create a ParquetWriter for the ParquetEnvelopeWriter, similar to the way the openFile method does it https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L42.

Still since this is already the second time the issue came up, I will also add a convenience method to ParquetWriter to open a new writer for a stream.

deitch commented 6 years ago

I will also add a convenience method to ParquetWriter to open a new writer for a stream

Thanks; then the file writer is just a special convenience case. I look forward to it.

While I am at it, not an issue per se, but the README says pure JavaScript implementation, but it does look like it has a dependency on native compiled libs:

$ npm i parquetjs

> ws@0.4.32 install /private/tmp/node_modules/ws
> (node-gyp rebuild 2> builderror.log) || (exit 0)

  CXX(target) Release/obj.target/bufferutil/src/bufferutil.o

> lzo@0.4.1 install /private/tmp/node_modules/lzo
> node-gyp rebuild
asmuth commented 6 years ago

I added the ParquetWriter::openStream method to master (https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L52). We'll release another version to NPM later after some testing.

Regarding the "pure JavaScript" thing I suppose it's a matter of definition. What we tried to communicate with that statement is that the parquetjs library does not somehow wrap the Apache Java or C++ implementation of parquet, but is a from-scratch implementation in JavaScript.

The lzo dependency specifically is problematic for some reasons but is entirely optional; hopefully one day LZO will be added to the node.js core api (like zlib). Until then we could add a more explicit option to run without lzo support for folks that want use parquetjs without node.js/FFI (but it should also work now as long as you don't explicitly request LZO compression).

vweevers commented 6 years ago

Passing by with a suggestion: "pure Node.js implementation".

asmuth commented 6 years ago

Sounds good to me!

deitch commented 6 years ago

I added the ParquetWriter::openStream method to master (https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L52). We'll release another version to NPM later after some testing.

Thanks. I also saw you updated the README to do ParquetWriter.openFile().

  1. Does it have a target version?
  2. Do you mind updating the README to include ParquetWriter.openStream()?

I suppose it's a matter of definition

Yes, and whole parquetjs is pure JS, it has a dependency that is not. Fortunately, it is reasonable.

Still and all, I like this. Thanks again.