Datasets does not load HuggingFace Repository properly

MikeDoes commented 8 months ago

Describe the bug

Dear Datasets team,

We just have published a dataset on Huggingface: https://huggingface.co/ai4privacy

However, when trying to read it using the Dataset library we get an error. As I understand jsonl files are compatible, could you please clarify how we can solve the issue? Please let me know and we would be more than happy to adapt the structure of the repository or meta data so it works easier:

from datasets import load_dataset
dataset = load_dataset("ai4privacy/pii-masking-200k")

Downloading readme: 100%
11.8k/11.8k [00:00<00:00, 512kB/s]
Downloading data files: 100%
1/1 [00:11<00:00, 11.16s/it]
Downloading data: 100%
64.3M/64.3M [00:02<00:00, 32.9MB/s]
Downloading data: 100%
113M/113M [00:03<00:00, 35.0MB/s]
Downloading data: 100%
97.7M/97.7M [00:02<00:00, 46.1MB/s]
Downloading data: 100%
90.8M/90.8M [00:02<00:00, 44.9MB/s]
Downloading data: 100%
7.63k/7.63k [00:00<00:00, 41.0kB/s]
Downloading data: 100%
1.03k/1.03k [00:00<00:00, 9.44kB/s]
Extracting data files: 100%
1/1 [00:00<00:00, 29.26it/s]
Generating train split:
209261/0 [00:05<00:00, 41201.25 examples/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1939                         )
-> 1940                     writer.write_table(table)
   1941                     num_examples_progress_update += len(table)

8 frames
[/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py](https://localhost:8080/#) in write_table(self, pa_table, writer_batch_size)
    571         pa_table = pa_table.combine_chunks()
--> 572         pa_table = table_cast(pa_table, self._schema)
    573         if self.embed_local_files:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in table_cast(table, schema)
   2327     if table.schema != schema:
-> 2328         return cast_table_to_schema(table, schema)
   2329     elif table.schema.metadata != schema.metadata:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in cast_table_to_schema(table, schema)
   2285     if sorted(table.column_names) != sorted(features):
-> 2286         raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
   2287     arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]

ValueError: Couldn't cast
JOBTYPE: int64
PHONEIMEI: int64
ACCOUNTNAME: int64
VEHICLEVIN: int64
GENDER: int64
CURRENCYCODE: int64
CREDITCARDISSUER: int64
JOBTITLE: int64
SEX: int64
CURRENCYSYMBOL: int64
IP: int64
EYECOLOR: int64
MASKEDNUMBER: int64
SECONDARYADDRESS: int64
JOBAREA: int64
ACCOUNTNUMBER: int64
language: string
BITCOINADDRESS: int64
MAC: int64
SSN: int64
EMAIL: int64
ETHEREUMADDRESS: int64
DOB: int64
VEHICLEVRM: int64
IPV6: int64
AMOUNT: int64
URL: int64
PHONENUMBER: int64
PIN: int64
TIME: int64
CREDITCARDNUMBER: int64
FIRSTNAME: int64
IBAN: int64
BIC: int64
COUNTY: int64
STATE: int64
LASTNAME: int64
ZIPCODE: int64
HEIGHT: int64
ORDINALDIRECTION: int64
MIDDLENAME: int64
STREET: int64
USERNAME: int64
CURRENCY: int64
PREFIX: int64
USERAGENT: int64
CURRENCYNAME: int64
LITECOINADDRESS: int64
CREDITCARDCVV: int64
AGE: int64
CITY: int64
PASSWORD: int64
BUILDINGNUMBER: int64
IPV4: int64
NEARBYGPSCOORDINATE: int64
DATE: int64
COMPANYNAME: int64
to
{'masked_text': Value(dtype='string', id=None), 'unmasked_text': Value(dtype='string', id=None), 'privacy_mask': Value(dtype='string', id=None), 'span_labels': Value(dtype='string', id=None), 'bio_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'tokenised_text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
[<ipython-input-2-f1c6811e9c83>](https://localhost:8080/#) in <cell line: 3>()
      1 from datasets import load_dataset
      2 
----> 3 dataset = load_dataset("ai4privacy/pii-masking-200k")

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2151 
   2152     # Download and prepare data
-> 2153     builder_instance.download_and_prepare(
   2154         download_config=download_config,
   2155         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    952                         if num_proc is not None:
    953                             prepare_split_kwargs["num_proc"] = num_proc
--> 954                         self._download_and_prepare(
    955                             dl_manager=dl_manager,
    956                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1047             try:
   1048                 # Prepare split will record examples associated to the split
-> 1049                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1050             except OSError as e:
   1051                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1811             job_id = 0
   1812             with pbar:
-> 1813                 for job_id, done, content in self._prepare_split_single(
   1814                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1815                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1956             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1957                 e = e.__context__
-> 1958             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1959 
   1960         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Thank you and have a great day ahead

Steps to reproduce the bug

Open Google Colab Notebook:

Run command: !pip3 install datasets

Run code: from datasets import load_dataset dataset = load_dataset("ai4privacy/pii-masking-200k")

Expected behavior

Download the dataset successfully from HuggingFace to the notebook so that we can start working with it

Environment info

datasets version: 2.14.6
Platform: Linux-5.15.120+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.19.1
PyArrow version: 9.0.0
Pandas version: 1.5.3

mariosasko commented 8 months ago

Hi! You can avoid the error by requesting only the jsonl files. dataset = load_dataset("ai4privacy/pii-masking-200k", data_files=["*.jsonl"]).

Our data file inference does not filter out (incompatible) json files because json and jsonl use the same builder. Still, I think the inference should differentiate these extensions because it's safe to assume that loading them together will lead to an error. WDYT @lhoestq?

lhoestq commented 8 months ago

Raising an error if there is a mix of json and jsonl in the builder makes sense yea

huggingface / datasets