huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.75k stars 2.59k forks source link

Support GeoParquet #6438

Open severo opened 7 months ago

severo commented 7 months ago

Feature request

Support the GeoParquet format

Motivation

GeoParquet (https://geoparquet.org/) is a common format for sharing vectorial geospatial data on the cloud, along with "traditional" data columns.

It would be nice to be able to load this format with datasets, and more generally, in the Datasets Hub (see https://huggingface.co/datasets/joshuasundance/govgis_nov2023-slim-spatial/discussions/1).

Your contribution

I would be happy to help work on a PR (but I don't think I can do one on my own).

Also, we have to define what we want to support:

joshuasundance-swca commented 7 months ago

Thank you, @severo ! I would be more than happy to help in any way I can. I am not familiar with this repo's codebase, but I would be eager to contribute. :)

For the preview in Datasets Hub, I think it makes sense to just display the geospatial column as text. If there were a dataset loader, though, I think it should be able to support the geospatial components. Geopandas is probably the most user-friendly interface for that. I'm not sure if it's currently relevant in the context of geoparquet, but I think the pyogrio driver is faster than fiona.

But the whole gdal dependency thing can be a real pain. If anything, it would need to be an optional dependency. Maybe it would be best if the loader tries importing relevant geospatial libraries, and in the event of an ImportError, falls back to text for the geometry column.

Please let me know if I can be of assistance, and thanks again for creating this Issue. :)

weiji14 commented 6 months ago

Just hitting into this same issue too showing GeoParquet files in Datasets Viewer. I tried to implement a custom reader for GeoParquet in https://huggingface.co/datasets/weiji14/clay_vector_embeddings/discussions/1, but it seems like HuggingFace has disabled datasets with custom loading scripts from using the dataset viewer according to https://discuss.huggingface.co/t/dataset-repo-requires-arbitrary-python-code-execution/59346 :frowning_face:

image

I'm thinking now if there's a way to simply map files with GeoParquet extensions (.gpq, .geoparquet, etc) to use the Parquet reader. Maybe we could allowlist these geoparquet file extensions at https://github.com/huggingface/datasets/blame/0caf91285116ec910f409e82cc6e1f4cff7496e3/src/datasets/packaged_modules/__init__.py#L30-L51? Having the table columns show up would be a quick win.

Longer term though, it would certainly be nice if the WKB geometry columns could be displayed in a nicer form. Geopandas' read_parquet function is supposedly faster than pyogrio.read_dataframe according to https://github.com/geopandas/geopandas/discussions/2724#discussioncomment-4606048, but there's also pyogrio.raw.read_arrow now that can read into a pyarrow.Table directly.

weiji14 commented 6 months ago

Update: It looks like renaming the GeoParquet file to have a file extension of *.parquet works (see https://huggingface.co/datasets/weiji14/clay_vector_embeddings). HuggingFace's default parquet reader is able to read the GeoParquet file, though the geometry column is of an unknown type:

image

I've opened a quick PR at #6508 to allow files with a *.geoparquet or *.gpq extension to be read using the default Parquet reader. Let's see how that goes :smile:

murdadesmaeeli commented 6 months ago

@joshuasundance-swca, @weiji14, If I'm understanding this correctly, the code below wouldn't be recommended to due to dependency headaches? If that's the case, what solution would there be to see the geometry features for .gpq files in huggingfaceHub?

code for dataset_loader.py

import geopandas as gpd
# ... (other imports remain the same)

class ClayVectorEmbeddings(datasets.ArrowBasedBuilder):
    # ... (other parts of the class remain the same)

    def _info(self):
        # Read the GeoParquet file to get the schema for the 'geometry' feature
        gdf = gpd.read_file("path/to/your/geoparquet/file.gpq")  # Replace with your file path
        geometry_schema = str(gdf.geometry.dtype)

        return datasets.DatasetInfo(
            # This is the description that will appear on the datasets page.
            description="Clay Vector Embeddings in GeoParquet format.",
            # This defines the different columns of the dataset and their types
            features=datasets.Features(
                {
                    "source_url": datasets.Value(dtype="string"),
                    "date": datasets.Value(dtype="date32"),
                    "embeddings": datasets.Value("string"),
                    "geometry": datasets.Value(dtype=geometry_schema),  # Use the schema read by GeoPandas
                    # ... (other features)
                }
            ),
        )

# ... (rest of the script remains the same)
weiji14 commented 5 months ago

Hi @mehrdad-es, I'm not sure if HuggingFace would be keen to add geopandas to HuggingFace Hub (maybe a question for @severo?). Having a geometry viewer would be an even bigger task, and if you're thinking of a map-viewer, it might involve some redesign of the website UI. Some of my colleagues are working on streamlining GeoParquet visualization from cloud-hosted instances like HuggingFace (see e.g. https://github.com/developmentseed/lonboard/issues/314), and we could definitely come up with something if there's interest.

severo commented 5 months ago

I've created https://github.com/huggingface/datasets-server/issues/2416 to discuss the possibility of supporting (vectorial) geospatial columns in the dataset viewer, or in the converted parquet files.

At the same time, it would be super interesting to see what is already possible to do with a Hugging Face dataset that hosts geospatial data.

Some of my colleagues are working on streamlining GeoParquet visualization from cloud-hosted instances like HuggingFace (see e.g. https://github.com/developmentseed/lonboard/issues/314), and we could definitely come up with something if there's interest.

It would be awesome to show this inside a Space.