holzschu / Carnets

Carnets is a stand-alone Jupyter notebook server and client. Edit your notebooks on the go, even where there is no network.
https://holzschu.github.io/Carnets_Jupyter/
BSD 3-Clause "New" or "Revised" License
567 stars 34 forks source link

Reading parquet with pandas #218

Open iamkucuk opened 2 years ago

iamkucuk commented 2 years ago

Hello.

Pandas requires pyarrow or fastparquet engines to read parquet files. Fastparquet installation fails when I try "!pip install fastparquet" and pyarrow succeeds. I can see the package had been installed with "!pip list". However, pandas still cannot utilize pyarrow and unable to read parquet files.

Any ideas?

holzschu commented 2 years ago

Hi, I'm reading the docs. I'm not sure why pyarrow succeeds, when they seem to have dynamic libraries included.

I think that parquet-python is an earlier version of fastparquet that is pure Python: https://github.com/jcrobak/parquet-python

From the documentation, fastparquet was forked from parquet-python, with the aim of providing faster implementation (Cython, multi-threaded, etc).

mikebrodt commented 1 year ago

I received an error when trying to install pyarrow, but that makes sense. However I was able to get fastparquet to install, at least according to pip, yet neither pandas nor dask can find it. I receive an error when listing it as the engine: ImportError: Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

holzschu commented 1 year ago

Hi, it's the same issue: fastparquet has installed precompiled dynamic libraries (because it cannot make the difference between OSX on Arm architecture and iOS on Arm architecture). iOS cannot load these dynamic libraries, because of the hard security rules implemented.

The best way to load parquets file is probably to go back to an earlier package that does not have dynamic libraries, such as parquet-python.

mikebrodt commented 1 year ago

Thanks for the response. I have tried to install parquet-python but it appears to have an error when installing where Cython is trying and failing to assign an int as a double. I can open an issue with that project, but I wanted to check with you first to see if that was expected before I did.

holzschu commented 1 year ago

I see the problem: parquet-python is pure Python, but it depends on thriftpy2, which itself has one Cython file. Since there are no Cython compilers, Carnets cannot install thriftpy2, and thus the install of parquet-python fails. I'm not certain how to fix that one. By editing thriftpy2/setup.py, it could be possible to disable the compiling of the extension, but would it work afterwards? I don't know.