skale-me / node-parquet

NodeJS module to access apache parquet format files
Apache License 2.0
57 stars 11 forks source link
node-parquet nodejs parquet skale-engine

Node-parquet

Build Status

Parquet is a columnar storage format available to any project in the Hadoop ecosystem. This nodejs module provides native bindings to the parquet functions from parquet-cpp.

A pure javascript parquet format driver (still in development) is also provided.

Build, install

The native c++ module has the following dependencies which must be installed before attempting to build:

Note that you need also python2 and c++/make toolchain for building NodeJS native addons.

The standard way of building and installing, provided that above depencies are met, is simply to run:

npm install

From 0.2.4 version, a command line tool called parquet is provided. It can be installed globally by running npm install -g. Note that if you install node-parquet this way, you can still use it as a dependency module in your local projects by linking (npm link node-parquet) which avoids the cost of recompiling the complete parquet-cpp library and its dependencies.

Otherwise, for developers to build directly from a github clone:

git clone https://github.com/mvertes/node-parquet.git
cd node-parquet
git submodule update --init --recursive
npm install [-g]

After install, the parquet-cpp build directory build_deps can be removed by running npm run clean, recovering all disk space taken for building parquet-cpp and its dependencies.

Usage

Command line tool

A command line tool parquet is provided. It's quite minimalist right now and needs to be improved:

Usage: parquet [options] <command> [<args>]

Command line tool to manipulate parquet files

Commands:
  cat file       Print file content on standard output
  head file      Print the first lines of file
  info file      Print file metadata

Options:
  -h,--help      Print this help text
  -V,--version   Print version and exits

Reading

The following example shows how to read a parquet file:

var parquet = require('node-parquet');

var reader = new parquet.ParquetReader('my_file.parquet');
console.log(reader.info());
console.log(reader.rows();
reader.close();

Writing

The following example shows how to write a parquet file:

var parquet = require('node-parquet');

var schema = {
  small_int: {type: 'int32', optional: true},
  big_int: {type: 'int64'},
  my_boolean: {type: 'bool'},
  name: {type: 'byte_array', optional: true},
};

var data = [
  [ 1, 23234, true, 'hello world'],
  [  , 1234, false, ],
];

var writer = new parquet.ParquetWriter('my_file.parquet', schema);
writer.write(data);
writer.close();

API reference

The API is not yet considered stable nor complete.

To use this module, one must require('node-parquet')

Class: parquet.ParquetReader

ParquetReader object performs read operations on a file in parquet format.

new parquet.ParquetReader(filename)

Construct a new parquet reader object.

Example:

const parquet = require('node-parquet');
const reader = new parquet.ParquetReader('./parquet_cpp_example.parquet');

reader.close()

Close the reader object.

reader.info()

Return an Object containing parquet file metadata. The object looks like:

{
  version: 0,
  createdBy: 'Apache parquet-cpp',
  rowGroups: 1,
  columns: 8,
  rows: 500
}

reader.read(column_number)

This is a low level function, it should not be used directly.

Read and return the next element in the column indicated by column_number.

In the case of a non-nested column, a basic value (Boolean, Number, String or Buffer) is returned, otherwise, an array of 3 elemnents is returned, where a[0] is the parquet definition level, a[1] the parquet repetition level, and a[2] the basic value. Definition and repetition levels are useful to reconstruct rows of composite, possibly sparse records with nested columns.

reader.rows([nb_rows])

Return an Array of rows, where each row is itself an Array of column elements.

Class: parquet.ParquetWriter

ParquetWriter object implements write operation on a parquet file.

new parquet.ParquetWriter(filename, schema, [compression])

Construct a new parquet writer object.

For example, considering the following object: { name: "foo", content: [ 1, 2, 3] }, its descriptor schema is:

const schema = {
  name: { type: "string" },
  content: {
    type: "group",
    repeated: "true",
    schema: { i0: { type: "int32" } }
  }
};

writer.close()

Close a file opened for writing. Calling this method explicitely before exiting is mandatory to ensure that memory content is correctly written in the file.

writer.writeSync(rows)

Write the content of rows in the file opened by the writer. Data from rows will be dispatched into the separate parquet columns according to the schema specified in the contructor.

For example, considering the above nested schema, a write operation could be:

writer.write([
  [ "foo", [ 1, 2, 3] ],
  [ "bar", [ 100, 400, 600, 2 ] ]
]);

Caveats and limitations

Plan is to improve this over time. Contributions are welcome.

License

Apache-2.0