huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.02k stars 2.63k forks source link

consumer-finance-complaints dataset not loading #5260

Open adiprasad opened 1 year ago

adiprasad commented 1 year ago

Describe the bug

Error during dataset loading

Steps to reproduce the bug

>>> import datasets
>>> cf_raw = datasets.load_dataset("consumer-finance-complaints")
Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.42k/8.42k [00:00<00:00, 3.33MB/s]
Downloading metadata: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.60k/5.60k [00:00<00:00, 2.90MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16.6k/16.6k [00:00<00:00, 510kB/s]
Downloading and preparing dataset consumer-finance-complaints/default to /root/.cache/huggingface/datasets/consumer-finance-complaints/default/0.0.0/30e483d37fb4b25bb98cad1bfd2dc48f6ed6d1f3371eb4568c625a61d1a79b69...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 511M/511M [00:04<00:00, 103MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/skunk-pod-storage-lee-2emartie-40ibm-2ecom-pvc/anaconda3/envs/datasets/lib/python3.8/site-packages/datasets/load.py", line 1741, in load_dataset
    builder_instance.download_and_prepare(
  File "/skunk-pod-storage-lee-2emartie-40ibm-2ecom-pvc/anaconda3/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 822, in download_and_prepare
    self._download_and_prepare(
  File "/skunk-pod-storage-lee-2emartie-40ibm-2ecom-pvc/anaconda3/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 1555, in _download_and_prepare
    super()._download_and_prepare(
  File "/skunk-pod-storage-lee-2emartie-40ibm-2ecom-pvc/anaconda3/envs/datasets/lib/python3.8/site-packages/datasets/builder.py", line 931, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/skunk-pod-storage-lee-2emartie-40ibm-2ecom-pvc/anaconda3/envs/datasets/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=1605177353, num_examples=2455765, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=2043641693, num_examples=3079747, shard_lengths=[721000, 656000, 788000, 846000, 68747], dataset_name='consumer-finance-complaints')}]

Expected behavior

dataset should load

Environment info

datasets.version '2.7.0'

Python 3.8.10

"Ubuntu 20.04.4 LTS"

albertvillanova commented 1 year ago

Thanks for reporting, @adiprasad.

We are having a look at it.

albertvillanova commented 1 year ago

I have opened an issue in that dataset Community tab on the Hub: https://huggingface.co/datasets/consumer-finance-complaints/discussions/1

Please note that in the meantime, you can load the dataset by passing ignore_verifications=True:

>>> ds = load_dataset("consumer-finance-complaints", ignore_verifications=True)
>>> ds
DatasetDict({
    train: Dataset({
        features: ['Date Received', 'Product', 'Sub Product', 'Issue', 'Sub Issue', 'Complaint Text', 'Company Public Response', 'Company', 'State', 'Zip Code', 'Tags', 'Consumer Consent Provided', 'Submitted via', 'Date Sent To Company', 'Company Response To Consumer', 'Timely Response', 'Consumer Disputed', 'Complaint ID'],
        num_rows: 3079747
    })
})
albertvillanova commented 1 year ago

PR fixing this issue: https://huggingface.co/datasets/consumer-finance-complaints/discussions/2