glue-viz / glue

Linked Data Visualizations Across Multiple Files
http://glueviz.org
Other
742 stars 153 forks source link

Load parquet files #2416

Open Gabriel-p opened 1 year ago

Gabriel-p commented 1 year ago

Is your feature request related to a problem? Please describe it: Pandas' parquet files are not loaded

Describe the solution you'd like: Load parquet files

Carifio24 commented 1 year ago

I agree that the ability to read Parquet files would be nice. It's probably worth investigating whether using something like pyarrow directly has any sort of performance gains over pandas.read_parquet, but if you're interested in a very minimal example of a Parquet data loader, you can add the snippet below (which requires pyarrow) to your glue config file, which should allow you to load at least basic Parquet files:

from glue.config import data_factory
from glue.core.data_factories.helpers import has_extension
from glue.core.data_factories.pandas import panda_process

from pandas import read_parquet

@data_factory(label="Parquet file", identifier=has_extension("parquet"))
def pandas_read_parquet(path, engine="pyarrow", **kwargs):
    df = read_parquet(path, engine=engine)
    return panda_process(df)
Gabriel-p commented 1 year ago

Thank you! It worked perfectly, I just removed the engine specification since my files open just fine with whatever pandas does by default