huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
702 stars 77 forks source link

Dataset Viewer issue for claritylab/UTCD #1176

Closed StefanHeng closed 1 year ago

StefanHeng commented 1 year ago

Link

https://huggingface.co/datasets/claritylab/UTCD

Description

The dataset viewer is not working for dataset claritylab/UTCD.

Error details:

Error code:   TooManyColumnsError

I'm having trouble to get dataset viewer to work.

I did a bit of research:

and looks like the only way to get data viewer to work is to call push_to_hub s.t. there are .parquet versions of my dataset in the ref/convert/parquet branch.

But I already had my dataset loading script with json in-place. I think I can both 1) keep the existing dataset loading script, and 2) add the parquet version of my datasets to the branch, as I think squad_v2 is an example for that.

So I created a branch and pushed my dataset again via python:

from huggingface_hub import create_branch
create_branch('claritylab/utcd', repo_type='dataset', branch='ref/convert/parquet')

dataset = load_dataset('claritylab/utcd', name='in-domain')
dataset.push_to_hub('claritylab/utcd', branch='ref/convert/parquet')

the code executed fine with no error and the output message below:

Downloading readme: 100%|██████████| 8.40k/8.40k [00:00<00:00, 1.41MB/s]
Found cached dataset utcd (/Users/stefanhg/.cache/huggingface/datasets/claritylab___utcd/aspect-normalized-in-domain/0.0.1/fe244a6f1dd95dfe9df993724e1b1ddb699c1900c2edb11a3380c7a2f6b78beb)
100%|██████████| 3/3 [00:00<00:00, 191.17it/s]
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format:   0%|          | 0/116 [00:00<?, ?ba/s]
Creating parquet from Arrow format:  35%|███▌      | 41/116 [00:00<00:00, 376.94ba/s]
Creating parquet from Arrow format: 100%|██████████| 116/116 [00:00<00:00, 316.08ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]
Upload 1 LFS files: 100%|██████████| 1/1 [00:11<00:00, 11.18s/it]
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:12<00:00, 12.06s/it]
Pushing split validation to the Hub.
Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|██████████| 13/13 [00:00<00:00, 469.34ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]
Upload 1 LFS files: 100%|██████████| 1/1 [00:01<00:00,  1.42s/it]
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it]
Pushing split test to the Hub.
Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format:   0%|          | 0/169 [00:00<?, ?ba/s]
Creating parquet from Arrow format:  27%|██▋       | 45/169 [00:00<00:00, 449.16ba/s]
Creating parquet from Arrow format:  53%|█████▎    | 90/169 [00:00<00:00, 429.39ba/s]
Creating parquet from Arrow format: 100%|██████████| 169/169 [00:00<00:00, 530.78ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]
Upload 1 LFS files: 100%|██████████| 1/1 [00:13<00:00, 13.75s/it]
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:14<00:00, 14.59s/it]

but I don't see any difference on my dataset repository website: branch named ref/convert/parquet available, and thus nothing in the branch.

Please help. Thank you!

albertvillanova commented 1 year ago

Thanks for reporting, @StefanHeng.

Please note that the ref/convert/parquet branch is an internal branch that our datasets server created automatically from the content in your main branch. You should not try to modify it because our datasets server will rewrite over it.

albertvillanova commented 1 year ago

@StefanHeng I have created an issue in the Community tab of your dataset: https://huggingface.co/datasets/claritylab/UTCD/discussions/1 Let's continue the discussion there!