huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
702 stars 77 forks source link

DuckDB parsing errors if a column name has quotes #1985

Open lhoestq opened 1 year ago

lhoestq commented 1 year ago

Not super important but affects the split-descriptive-statistics and the split-duckdb-index jobs

e.g. this dataset has a pretty long column name that has quotes and it raises this error

│ INFO: 2023-10-16 12:27:43,231 - root - Compute descriptive statistics for dataset='lunaluan/chatbox3_history', config='default', split='train'                                                                                                 │
│ INFO: 2023-10-16 12:27:43,233 - root - Downloading remote parquet files to a local directory /storage/stats-cache/81004481536938-split-descriptive-statistics-lunaluan-chatbox3_hi-0d92723b.                                                   │
│ Downloading 0000.parquet:   0%|          | 0.00/6.75k [00:00<?, ?B/s]Downloading 0000.parquet: 100%|██████████| 6.75k/6.75k [00:00<00:00, 5.67MB/s]                                                                                            │
│ INFO: 2023-10-16 12:27:43,912 - root - Loading data into in-memory table.                                                                                                                                                                      │
│ ERROR: 2023-10-16 12:27:44,068 - root - Parser Error: syntax error at or near "detail"                                                                                                                                                         │
│ LINE 2: ... over his Professor's face. Mention "in detail description" how the professor ...                                                                                                                                                   │
│                                                   ^                                                                                                                                                                                            │
│ Traceback (most recent call last):                                                                                                                                                                                                             │
│   File "/src/services/worker/src/worker/job_manager.py", line 168, in process                                                                                                                                                                  │
│     job_result = self.job_runner.compute()                                                                                                                                                                                                     │
│   File "/src/services/worker/src/worker/job_runners/split/descriptive_statistics.py", line 591, in compute                                                                                                                                     │
│     compute_descriptive_statistics_response(                                                                                                                                                                                                   │
│   File "/src/services/worker/src/worker/job_runners/split/descriptive_statistics.py", line 485, in compute_descriptive_statistics_response                                                                                                     │
│     con.sql(                                                                                                                                                                                                                                   │
│ duckdb.ParserException: Parser Error: syntax error at or near "detail"                                                                                                                                                                         │
│ LINE 2: ... over his Professor's face. Mention "in detail description" how the professor ...                                                                                                                                                   │
│                                                   ^                                                                                                                                                                                            │
albertvillanova commented 1 year ago

Thanks for pointing out, @lhoestq.

On the one hand, should we support column names with quotes?

On the other hand, this specific dataset just contains a CSV without header row: the column name is indeed text content.

albertvillanova commented 1 year ago

I agree at least we could catch this error, as we already do in /filter, and raise a specific error.

lhoestq commented 1 year ago

No need to spend time on this imo, I mostly created this issue to save time next time we see a similar issue. But yea a nice error message would be better

polinaeterna commented 1 year ago

i think we actually should disallow column names with any weird characters, i'd do something similar to validation in /filter feature (disallowing ";", "--", r"/\*", r"\*/"), wanted to work on this