jcrobak / parquet-python

python implementation of the parquet columnar file format.
Apache License 2.0
340 stars 257 forks source link

Consider major rewrite #46

Closed jcrobak closed 7 years ago

jcrobak commented 8 years ago

@martindurant has put together a shiny new implementation that improves performance, adds interop with dataframes libraries, and adds write support. See https://github.com/martindurant/parquet-python/pull/3

The major changes are new interfaces and dependencies on several new packages (numpy, pandas, numba, dask). I'd love feedback from folks using parquet-python on how invasive those changes would be...especially given the historic problems installing some of those libraries.

Please let me know what you think. Some folks that have contributed and may have an opinion include @SergeNov @turicas @spaztic1215 but anyone is welcome to chime in!

spaztic1215 commented 8 years ago

The interoperability with dataframes and potential efficiencies with dask's task scheduling is exciting but unfortunately for Hue, we would have to omit the dependencies on numba and possibly dask due to licensing constraints. It would be nice to make these optional dependencies if these changes are to be pulled into parquet-python.

mrocklin commented 8 years ago

My guess is that Dask is optional but that Numba is not.

mrocklin commented 8 years ago

Out of curiosity, why is Numba a problem but not NumPy or Pandas (which have the same license). Is there another constraint other than the choice of license that is active here?

spaztic1215 commented 8 years ago

BSD and MIT licenses are generally fine, but we'd have to check on some of numba's dependencies for Python 2 (Hue still has to support Python 2.6 for now) like funcsigs.

martindurant commented 8 years ago

I have made no particular effort yet to make my code compatible with python 2 while I am still developing core functionality, but I don't suppose it should be too onerous.

turicas commented 8 years ago

Interoperability with dataframes and other structures are very important, but I think it should not be mandatory, since there are many use cases when installing all those libraries will be overkill, for example: what if I just want to extract data from a parquet file and convert it to a CSV? If the entire architecture is well documented and modular, I think we could have some extra features available if these libraries are installed, but the bare minimum to read/write parquet files should work without it.

aloneguid commented 7 years ago

Please don't rewrite, this is the only library written with understandable code, unlike parquet-mr and parquet-cpp which are easier to rewrite than read the code.

martindurant commented 7 years ago

@aloneguid , I agree that this library is nicely written and blissfully few lines of code. I have attempted to make my version, which forked from here, respect this style and believe (although this is subjective) that the result is very hackable.

To everyone: we have announced beta status here and the github repo is now here with docs on RTD.