huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
702 stars 77 forks source link

Heavy jobs for split-descriptive-statistics failing #2679

Closed AndreaFrancis closed 7 months ago

AndreaFrancis commented 7 months ago

Many heavy jobs are failing for split-descriptive-statistics type:

INFO: 2024-04-08 15:34:20,432 - root - [split-descriptive-statistics] compute JobManager(job_id=66140a2ce7e9f14f2970867f dataset=eurecom-ds/scoresdeve_activations_shapes3d job_info={'job_id': '66140a2ce7e9f14f2970867f', 'type': 'split-descriptive-statistics', 'params': {'dataset': 'eurecom-ds/scoresdeve_activations_shapes3d', 'revision': 'ce3531e8306b8309dd0017691c814cd97d84fac2', 'config': 't_0.6_down_blocks.4.attentions.1.to_v', 'split': 'test'}, 'priority': <Priority.LOW: 'low'>, 'difficulty': 100}
INFO: 2024-04-08 15:34:20,433 - root - compute 'split-descriptive-statistics' for dataset='eurecom-ds/scoresdeve_activations_shapes3d' config='t_0.6_down_blocks.4.attentions.1.to_v' split='test'
INFO: 2024-04-08 15:34:20,449 - root - Downloading remote parquet files to a local directory /tmp/stats-cache/70987971222717-split-descriptive-statistics-eurecom-ds-scoresdev-5a53a11c.
INFO: 2024-04-08 15:34:20,449 - root - Sleep during 1 seconds to preventively mitigate rate limiting.
thread '<unnamed>' panicked at crates/polars-core/src/datatypes/field.rs:176:19:
Arrow datatype Extension("datasets.features.features.Array2DExtensionType", LargeList(Field { name: "item", data_type: LargeList(Field { name: "item", data_type: Float32, is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }), Some("[[16, 256], \"float32\"]")) not supported by Polars. You probably need to activate that data-type feature.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ERROR: 2024-04-08 15:34:22,098 - root - quit due to an uncaught error: Arrow datatype Extension("datasets.features.features.Array2DExtensionType", LargeList(Field { name: "item", data_type: LargeList(Field { name: "item", data_type: Float32, is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }), Some("[[16, 256], \"float32\"]")) not supported by Polars. You probably need to activate that data-type feature.
Traceback (most recent call last):
  File "/src/services/worker/src/worker/loop.py", line 98, in run
    if self.has_resources() and self.process_next_job():
  File "/src/services/worker/src/worker/loop.py", line 129, in process_next_job
    job_result = job_manager.run_job()
  File "/src/services/worker/src/worker/job_manager.py", line 103, in run_job
    job_result: JobResult = self.process()
  File "/src/services/worker/src/worker/job_manager.py", line 125, in process
    job_result = self.job_runner.compute()
  File "/src/services/worker/src/worker/job_runners/split/descriptive_statistics.py", line 900, in compute
    compute_descriptive_statistics_response(
  File "/src/services/worker/src/worker/job_runners/split/descriptive_statistics.py", line 780, in compute_descriptive_statistics_response
    local_parquet_split_glob, columns=[pl.scan_parquet(local_parquet_split_glob).columns[0]]
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/io/parquet/functions.py", line 311, in scan_parquet
    return pl.LazyFrame._scan_parquet(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 466, in _scan_parquet
    self._ldf = PyLazyFrame.new_from_parquet(
pyo3_runtime.PanicException: Arrow datatype Extension("datasets.features.features.Array2DExtensionType", LargeList(Field { name: "item", data_type: LargeList(Field { name: "item", data_type: Float32, is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }), Some("[[16, 256], \"float32\"]")) not supported by Polars. You probably need to activate that data-type feature.
Traceback (most recent call last):
  File "/src/services/worker/src/worker/start_worker_loop.py", line 75, in <module>
    loop.run()
  File "/src/services/worker/src/worker/loop.py", line 98, in run
    if self.has_resources() and self.process_next_job():
  File "/src/services/worker/src/worker/loop.py", line 129, in process_next_job
    job_result = job_manager.run_job()
  File "/src/services/worker/src/worker/job_manager.py", line 103, in run_job
    job_result: JobResult = self.process()
  File "/src/services/worker/src/worker/job_manager.py", line 125, in process
    job_result = self.job_runner.compute()
  File "/src/services/worker/src/worker/job_runners/split/descriptive_statistics.py", line 900, in compute
    compute_descriptive_statistics_response(
  File "/src/services/worker/src/worker/job_runners/split/descriptive_statistics.py", line 780, in compute_descriptive_statistics_response
    local_parquet_split_glob, columns=[pl.scan_parquet(local_parquet_split_glob).columns[0]]
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/io/parquet/functions.py", line 311, in scan_parquet
    return pl.LazyFrame._scan_parquet(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 466, in _scan_parquet
    self._ldf = PyLazyFrame.new_from_parquet(
pyo3_runtime.PanicException: Arrow datatype Extension("datasets.features.features.Array2DExtensionType", LargeList(Field { name: "item", data_type: LargeList(Field { name: "item", data_type: Float32, is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }), Some("[[16, 256], \"float32\"]")) not supported by Polars. You probably need to activate that data-type feature.
ERROR: 2024-04-08 15:34:23,394 - root - Worker crashed (exit code 1) when running job_id=66140a2ce7e9f14f2970867f
Traceback (most recent call last):
  File "/src/services/worker/src/worker/main.py", line 76, in <module>
    worker_executor.start()
  File "/src/services/worker/src/worker/executor.py", line 129, in start
    loop.run_until_complete(
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/src/services/worker/src/worker/executor.py", line 37, in every
    out = func(*args, **kwargs)
  File "/src/services/worker/src/worker/executor.py", line 199, in is_worker_alive
    worker_loop_executor.stop()  # raises an error if the worker returned unexpected exit code
  File "/src/services/worker/.venv/lib/python3.9/site-packages/mirakuru/base.py", line 375, in stop
    raise ProcessFinishedWithError(self, exit_code)
mirakuru.exceptions.ProcessFinishedWithError: The process invoked by the <mirakuru.output.OutputExecutor: "/src/services/worker/.venv/bin/python /src/services/worker/src/worker/start_worker_loop.py --print-worker-state-path" 0x7f11331a2670> executor has exited with a non-zero code: 1.

Maybe related to https://github.com/huggingface/dataset-viewer/pull/2670? cc. @polinaeterna and @albertvillanova for visualization.

polinaeterna commented 7 months ago

it seems to be unrelated to the pyarrow version update, it's polars not being able to recognize custom datasets lib's arrow data types, i will work on it.

polinaeterna commented 7 months ago

i'm actually not sure if it's even possible to register an extension data type in polars :((

albertvillanova commented 7 months ago

The update of pyarrow was just a patch release (from 14.0.1 to 14.0.2), so no breaking is expected in principle.