huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.99k stars 2.62k forks source link

pyarrow.lib.ArrowInvalid: Unable to merge: Field <field> has incompatible types #5692

Open cyanic-selkie opened 1 year ago

cyanic-selkie commented 1 year ago

Describe the bug

When loading the dataset wikianc-en which I created using this code, I get the following error:

Traceback (most recent call last):
  File "/home/sven/code/rector/answer-detection/train.py", line 106, in <module>
    (dataset, weights) = get_dataset(args.dataset, tokenizer, labels, args.padding)
  File "/home/sven/code/rector/answer-detection/dataset.py", line 106, in get_dataset
    dataset = load_dataset("cyanic-selkie/wikianc-en")
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1794, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1106, in as_dataset
    datasets = map_nested(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 443, in map_nested
    mapped = [
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 444, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
    return function(data_struct)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1136, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1207, in _as_dataset
    dataset_kwargs = ArrowReader(cache_dir, self.info).read(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 239, in read
    return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 260, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 203, in _read_files
    pa_table = concat_tables(pa_tables) if len(pa_tables) != 1 else pa_tables[0]
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1808, in concat_tables
    return ConcatenationTable.from_tables(tables, axis=axis)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1514, in from_tables
    return cls.from_blocks(blocks)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1427, in from_blocks
    table = cls._concat_blocks(blocks, axis=0)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1373, in _concat_blocks
    return pa.concat_tables(pa_tables, promote=True)
  File "pyarrow/table.pxi", line 5224, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field paragraph_anchors has incompatible types: list<: struct<start: uint32 not null, end: uint32 not null, qid: uint32, pageid: uint32, title: string not null> not null> vs list<item: struct<start: uint32, end: uint32, qid: uint32, pageid: uint32, title: string>>

This only happens when I load the train split, indicating that the size of the dataset is the deciding factor.

Steps to reproduce the bug

from datasets import load_dataset

dataset = load_dataset("cyanic-selkie/wikianc-en", split="train")

Expected behavior

The dataset should load normally without any errors.

Environment info

mariosasko commented 1 year ago

Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?

cyanic-selkie commented 1 year ago

Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?

Sorry about that, it's fixed now.

MingsYang commented 1 year ago

@cyanic-selkie could you explain how you fixed it? I met the same error in loading other datasets, is it due to the version of the library enviroment?

cyanic-selkie commented 1 year ago

@MingsYang I never fixed it. If you're referring to my comment above, I only meant I fixed the link to my code.

Anyway, I managed to work around the issue by using streaming when loading the dataset.

MingsYang commented 1 year ago

@cyanic-selkie Emm, I get it. I just tried to use a new version python enviroment, and it show no errors anymore.

ThyrixYang commented 8 months ago

Upgrade pyarrow to the latest version solves this problem in my case.