ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
345 stars 173 forks source link

Modernizing this project #112

Open hayes opened 3 years ago

hayes commented 3 years ago

Hey, I am interested in using and contributing to this project. Specifically there are a couple of things I would really like to add for my own purposes:

  1. Support for enums
  2. Proper support for >53 bit integers (using BigInt)
  3. Mirgrate to LogicalTypes from ConvertedTypes to enable other new features in the future.
  4. Typescript support

I would be happy to contribute a lot of this myself, but some of this would take significant amounts of work, and I don't want to put a lot of time into something if this isn't compatible with your vision of this project. If I were to spend a significant amount of time adding some of these features, I would also love to add a few developer tools to make things a little easier along the way including:

I am mostly opening up this issue to get an idea about what you have in mind for this repo, how open you are to outside contributions, and how responsive/reseptive you would be to some of these types of changes! Don't want to step on any toes, and I know some of what I was proposing up would constitute breaking changes, and would need careful consideration.

dobesv commented 3 years ago

There's a decent fork of this with typescript support:

https://github.com/kbajalc/parquets

Fixes a few bugs compared to this repo, I believe.

It seems to be unmaintained, though, like this project.

I think this other one, parquetjs-lite, is somewhat actively maintained: https://github.com/ZJONSSON/parquetjs

But doesn't have typescript support.

Sad times for parquet in JavaScript!

hayes commented 3 years ago

I ended up starting from scratch, I think my requirements are probably outside the scope of what would be easy to achieve through pull requests to this (or any other existing) project. It's going to be a while before I have enough time to get it polished enough for it to be used by other people, but WIP is here (undocumented, not currently working, and going to change significantly): https://github.com/hayes/node-parquet

Right now it doesn't even build because I am in the process of re-organizing all the code, but I made pretty good progress on the reader side of things before I started re-structuring, and was able to parse several of the examples form https://github.com/apache/parquet-testing/tree/master/data, and added support for most of the main encodings and compressions. Hopefully in a month or so, I'll have something that can read and write reliably, and can start working on the stuff I actually want like filtering, plugable file systems for efficient reads/writes from s3, and other more advanced features.