Open severo opened 7 months ago
Thank you, @severo ! I would be more than happy to help in any way I can. I am not familiar with this repo's codebase, but I would be eager to contribute. :)
For the preview in Datasets Hub, I think it makes sense to just display the geospatial column as text. If there were a dataset loader, though, I think it should be able to support the geospatial components. Geopandas is probably the most user-friendly interface for that. I'm not sure if it's currently relevant in the context of geoparquet, but I think the pyogrio driver is faster than fiona.
But the whole gdal dependency thing can be a real pain. If anything, it would need to be an optional dependency. Maybe it would be best if the loader tries importing relevant geospatial libraries, and in the event of an ImportError, falls back to text for the geometry column.
Please let me know if I can be of assistance, and thanks again for creating this Issue. :)
Just hitting into this same issue too showing GeoParquet files in Datasets Viewer. I tried to implement a custom reader for GeoParquet in https://huggingface.co/datasets/weiji14/clay_vector_embeddings/discussions/1, but it seems like HuggingFace has disabled datasets with custom loading scripts from using the dataset viewer according to https://discuss.huggingface.co/t/dataset-repo-requires-arbitrary-python-code-execution/59346 :frowning_face:
I'm thinking now if there's a way to simply map files with GeoParquet extensions (.gpq, .geoparquet, etc) to use the Parquet reader. Maybe we could allowlist these geoparquet file extensions at https://github.com/huggingface/datasets/blame/0caf91285116ec910f409e82cc6e1f4cff7496e3/src/datasets/packaged_modules/__init__.py#L30-L51? Having the table columns show up would be a quick win.
Longer term though, it would certainly be nice if the WKB geometry columns could be displayed in a nicer form. Geopandas' read_parquet function is supposedly faster than pyogrio.read_dataframe
according to https://github.com/geopandas/geopandas/discussions/2724#discussioncomment-4606048, but there's also pyogrio.raw.read_arrow
now that can read into a pyarrow.Table
directly.
Update: It looks like renaming the GeoParquet file to have a file extension of *.parquet
works (see https://huggingface.co/datasets/weiji14/clay_vector_embeddings). HuggingFace's default parquet reader is able to read the GeoParquet file, though the geometry column is of an unknown type:
I've opened a quick PR at #6508 to allow files with a *.geoparquet
or *.gpq
extension to be read using the default Parquet reader. Let's see how that goes :smile:
@joshuasundance-swca, @weiji14, If I'm understanding this correctly, the code below wouldn't be recommended to due to dependency headaches? If that's the case, what solution would there be to see the geometry features for .gpq files in huggingfaceHub?
code for dataset_loader.py
import geopandas as gpd
# ... (other imports remain the same)
class ClayVectorEmbeddings(datasets.ArrowBasedBuilder):
# ... (other parts of the class remain the same)
def _info(self):
# Read the GeoParquet file to get the schema for the 'geometry' feature
gdf = gpd.read_file("path/to/your/geoparquet/file.gpq") # Replace with your file path
geometry_schema = str(gdf.geometry.dtype)
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
description="Clay Vector Embeddings in GeoParquet format.",
# This defines the different columns of the dataset and their types
features=datasets.Features(
{
"source_url": datasets.Value(dtype="string"),
"date": datasets.Value(dtype="date32"),
"embeddings": datasets.Value("string"),
"geometry": datasets.Value(dtype=geometry_schema), # Use the schema read by GeoPandas
# ... (other features)
}
),
)
# ... (rest of the script remains the same)
Hi @mehrdad-es, I'm not sure if HuggingFace would be keen to add geopandas
to HuggingFace Hub (maybe a question for @severo?). Having a geometry viewer would be an even bigger task, and if you're thinking of a map-viewer, it might involve some redesign of the website UI. Some of my colleagues are working on streamlining GeoParquet visualization from cloud-hosted instances like HuggingFace (see e.g. https://github.com/developmentseed/lonboard/issues/314), and we could definitely come up with something if there's interest.
I've created https://github.com/huggingface/datasets-server/issues/2416 to discuss the possibility of supporting (vectorial) geospatial columns in the dataset viewer, or in the converted parquet files.
At the same time, it would be super interesting to see what is already possible to do with a Hugging Face dataset that hosts geospatial data.
Some of my colleagues are working on streamlining GeoParquet visualization from cloud-hosted instances like HuggingFace (see e.g. https://github.com/developmentseed/lonboard/issues/314), and we could definitely come up with something if there's interest.
It would be awesome to show this inside a Space.
Feature request
Support the GeoParquet format
Motivation
GeoParquet (https://geoparquet.org/) is a common format for sharing vectorial geospatial data on the cloud, along with "traditional" data columns.
It would be nice to be able to load this format with datasets, and more generally, in the Datasets Hub (see https://huggingface.co/datasets/joshuasundance/govgis_nov2023-slim-spatial/discussions/1).
Your contribution
I would be happy to help work on a PR (but I don't think I can do one on my own).
Also, we have to define what we want to support: