huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Convert_to_parquet fails for datasets with multiple configs #7067

Closed HuangZhen02 closed 3 months ago

HuangZhen02 commented 4 months ago

If the dataset has multiple configs, when using the datasets-cli convert_to_parquet command to avoid issues with the data viewer caused by loading scripts, the conversion process only successfully converts the data corresponding to the first config. When it starts converting the second config, it throws an error:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/dl/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/dl/lib/python3.10/site-packages/datasets/commands/datasets_cli.py", line 41, in main
    service.run()
  File "/opt/anaconda3/envs/dl/lib/python3.10/site-packages/datasets/commands/convert_to_parquet.py", line 83, in run
    dataset.push_to_hub(
  File "/opt/anaconda3/envs/dl/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1713, in push_to_hub
    api.create_branch(repo_id, branch=revision, token=token, repo_type="dataset", exist_ok=True)
  File "/opt/anaconda3/envs/dl/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/opt/anaconda3/envs/dl/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 5503, in create_branch
    hf_raise_for_status(response)
  File "/opt/anaconda3/envs/dl/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-669fc665-7c2e80d75f4337496ee95402;731fcdc7-0950-4eec-99cf-ce047b8d003f)

Bad request:
Invalid reference for a branch: refs/pr/1
HuangZhen02 commented 4 months ago

Many users have encountered the same issue, which has caused inconvenience.

https://discuss.huggingface.co/t/convert-to-parquet-fails-for-datasets-with-multiple-configs/86733

albertvillanova commented 4 months ago

Thanks for reporting.

I will make the code more robust.

albertvillanova commented 4 months ago

I have opened an issue in the huggingface-hub repo:

I am opening a PR to avoid calling create_branch if the branch already exists.