huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
689 stars 76 forks source link

Support vectorial geospatial columns #2416

Open severo opened 8 months ago

severo commented 8 months ago

Requires https://github.com/huggingface/datasets/issues/6438, to support GeoParquet. We could support more formats.

Possibly requires geopandas as a dependency.

severo commented 8 months ago

at least, https://github.com/huggingface/datasets-server/issues/2428 will "Read GeoParquet files using parquet reader" https://github.com/huggingface/datasets/pull/6508

weiji14 commented 7 months ago

Thanks @severo for opening this! As I understand, is an update needed on the server to pull in https://github.com/huggingface/datasets/pull/6508, so that GeoParquet datasets like https://huggingface.co/datasets/joshuasundance/govgis_nov2023-slim-spatial will show up on the Dataset Viewer?

severo commented 7 months ago

It does :)

Capture d’écran 2024-02-12 à 23 06 03

Note that we only have the first 100 rows on this dataset, because we ran into two other issues!

  1. size of the row groups in the geoparquet files:
worker.job_runners.config.parquet_and_info.TooBigRowGroupsError: Parquet file has too big row groups. First row group has 950423110 which exceeds the limit of 300000000
  1. issue with the features:
datasets.table.CastError: Couldn't cast
id: string
name: string
type: string
description: string
url: string
metadata_text: string
embeddings: list<element: double>
  child 0, element: double
geometry: binary
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1178
geo: '{"primary_column": "geometry", "columns": {"geometry": {"encoding":' + 1306
to
{'id': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'metadata_text': Value(dtype='string', id=None), 'geometry': Value(dtype='binary', id=None)}
because column names don't match
weiji14 commented 7 months ago

Awesome, this is a big step forward!

size of the row groups in the geoparquet files:

To be honest, 950423110 does seem like a bit much for a single row group, but a row group shouldn't be too small either. DuckDB has some nice about this here - https://duckdb.org/docs/guides/performance/file_formats#the-effect-of-row-group-sizes

issue with the features:

Hmm, which field is the CastError on? Is it something in the schema metadata? The log seems truncated or something, so I can't quite tell.

severo commented 7 months ago

the full traceback

Traceback (most recent call last):
  File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1169, in compute_config_parquet_and_info_response
      fill_builder_info(builder, hf_endpoint=hf_endpoint, hf_token=hf_token, validate=validate)
  File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 619, in fill_builder_info
      parquet_files_and_sizes: list[tuple[pq.ParquetFile, int]] = thread_map(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
      return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
      return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "/src/services/worker/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1166, in __iter__
      for obj in iterable:
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
      yield fs.pop().result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result
      return self.__get_result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
      raise self._exception
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
      result = self.fn(*self.args, **self.kwargs)
  File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 556, in retry_and_validate_get_parquet_file_and_size
      validate(pf)
  File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 588, in validate
      raise TooBigRowGroupsError(
  worker.job_runners.config.parquet_and_info.TooBigRowGroupsError: Parquet file has too big row groups. First row group has 950423110 which exceeds the limit of 300000000

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
      for _, table in generator:
  File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 712, in wrapped
      for item in generator(*args, **kwargs):
  File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/parquet/parquet.py", line 94, in _generate_tables
      yield f"{file_idx}_{batch_idx}", self._cast_table(pa_table)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/parquet/parquet.py", line 74, in _cast_table
      pa_table = table_cast(pa_table, self.info.features.arrow_schema)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in table_cast
      return cast_table_to_schema(table, schema)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2194, in cast_table_to_schema
      raise CastError(
  datasets.table.CastError: Couldn't cast
id: string
name: string
type: string
description: string
url: string
metadata_text: string
embeddings: list<element: double>
  child 0, element: double
geometry: binary
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1178
geo: '{"primary_column": "geometry", "columns": {"geometry": {"encoding":' + 1306
to
{'id': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'metadata_text': Value(dtype='string', id=None), 'geometry': Value(dtype='binary', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/src/services/worker/src/worker/job_manager.py", line 158, in process
      job_result = self.job_runner.compute()
  File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1288, in compute
      compute_config_parquet_and_info_response(
    File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1178, in compute_config_parquet_and_info_response
      parquet_operations, partial = stream_convert_to_parquet(
    File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 802, in stream_convert_to_parquet
      builder._prepare_split(
    File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1860, in _prepare_split
      for job_id, done, content in self._prepare_split_single(
    File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
      raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
severo commented 6 months ago

related (raster, not vectorial: geotiff) https://github.com/huggingface/datasets/issues/6740

severo commented 3 months ago

geopandas has reached 1.0.0

https://github.com/geopandas/geopandas/releases/tag/v1.0.0