Open dylanrstewart opened 2 years ago
cudf has a parquet writer, but it doesn't support the ListColumn
type that GeoArrow is based on, yet. We might consider moving this request over to cudf and seeing what they think:
host_dataframe = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
gpu_dataframe = cuspatial.from_geopandas(host_dataframe)
continents_dataframe = gpu_dataframe.sort_values("continent")
continents_dataframe.to_parquet("parquet_file")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 gpu_dataframe = cuspatial.from_geopandas(host_dataframe)
3 continents_dataframe = gpu_dataframe.sort_values("continent")
----> 4 continents_dataframe.to_parquet("parquet_file")
File ~/cudf/python/cudf/cudf/core/dataframe.py:5990, in DataFrame.to_parquet(self, path, *args, **kwargs)
5987 """{docstring}"""
5988 from cudf.io import parquet
-> 5990 return parquet.to_parquet(self, path, *args, **kwargs)
File ~/compose/etc/conda/cuda_11.6/envs/notebooks/lib/python3.8/contextlib.py:75, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
72 @wraps(func)
73 def inner(*args, **kwds):
74 with self._recreate_cm():
---> 75 return func(*args, **kwds)
File ~/cudf/python/cudf/cudf/io/parquet.py:630, in to_parquet(df, path, engine, compression, index, partition_cols, partition_file_name, partition_offsets, statistics, metadata_file_path, int96_timestamps, row_group_size_bytes, row_group_size_rows, *args, **kwargs)
628 for col in df._column_names:
629 if partition_cols is None or col not in partition_cols:
--> 630 if df[col].dtype.name == "category":
631 raise ValueError(
632 "'category' column dtypes are currently not "
633 + "supported by the gpu accelerated parquet writer"
634 )
636 if partition_cols:
AttributeError: 'str' object has no attribute 'name'
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Is your feature request related to a problem? Please describe. I wish I could use cuSpatial to write data to GeoParquet. I have several GBs of census data with various features stored in GeoParquet format. I would like to be able to load the data and process without relying on CPU, then store statistics with geometries as GeoParquet.
Describe the solution you'd like
cugdf.to_parquet()
Describe alternatives you've considered Dask Geopandas (cpu)
Additional context One of my complaints with the
Dask pyarrow
implementation is that reading and writing parquet doesn't automatically keep up with the_metadata
file information, I would like that to be the default, if possible.