[FEA] Write data to parquet

dylanrstewart commented 2 years ago

Is your feature request related to a problem? Please describe. I wish I could use cuSpatial to write data to GeoParquet. I have several GBs of census data with various features stored in GeoParquet format. I would like to be able to load the data and process without relying on CPU, then store statistics with geometries as GeoParquet.

Describe the solution you'd like cugdf.to_parquet()

Describe alternatives you've considered Dask Geopandas (cpu)

Additional context One of my complaints with the Dask pyarrow implementation is that reading and writing parquet doesn't automatically keep up with the _metadata file information, I would like that to be the default, if possible.

thomcom commented 2 years ago

cudf has a parquet writer, but it doesn't support the ListColumn type that GeoArrow is based on, yet. We might consider moving this request over to cudf and seeing what they think:

host_dataframe = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
gpu_dataframe = cuspatial.from_geopandas(host_dataframe)
continents_dataframe = gpu_dataframe.sort_values("continent")
continents_dataframe.to_parquet("parquet_file")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [8], in <cell line: 4>()
      2 gpu_dataframe = cuspatial.from_geopandas(host_dataframe)
      3 continents_dataframe = gpu_dataframe.sort_values("continent")
----> 4 continents_dataframe.to_parquet("parquet_file")

File ~/cudf/python/cudf/cudf/core/dataframe.py:5990, in DataFrame.to_parquet(self, path, *args, **kwargs)
   5987 """{docstring}"""
   5988 from cudf.io import parquet
-> 5990 return parquet.to_parquet(self, path, *args, **kwargs)

File ~/compose/etc/conda/cuda_11.6/envs/notebooks/lib/python3.8/contextlib.py:75, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     72 @wraps(func)
     73 def inner(*args, **kwds):
     74     with self._recreate_cm():
---> 75         return func(*args, **kwds)

File ~/cudf/python/cudf/cudf/io/parquet.py:630, in to_parquet(df, path, engine, compression, index, partition_cols, partition_file_name, partition_offsets, statistics, metadata_file_path, int96_timestamps, row_group_size_bytes, row_group_size_rows, *args, **kwargs)
    628 for col in df._column_names:
    629     if partition_cols is None or col not in partition_cols:
--> 630         if df[col].dtype.name == "category":
    631             raise ValueError(
    632                 "'category' column dtypes are currently not "
    633                 + "supported by the gpu accelerated parquet writer"
    634             )
    636 if partition_cols:

AttributeError: 'str' object has no attribute 'name'

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapidsai / cuspatial

[FEA] Write data to parquet #630