huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.01k stars 2.63k forks source link

Offset overflow when slicing a big dataset with an array of indices in Pyarrow >= 1.0.0 #615

Closed lhoestq closed 3 years ago

lhoestq commented 4 years ago

How to reproduce:

from datasets import load_dataset

wiki = load_dataset("wikipedia", "20200501.en", split="train")
wiki[[0]]

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-13-381aedc9811b> in <module>
----> 1 wikipedia[[0]]

~/Desktop/hf/nlp/src/datasets/arrow_dataset.py in __getitem__(self, key)
   1069             format_columns=self._format_columns,
   1070             output_all_columns=self._output_all_columns,
-> 1071             format_kwargs=self._format_kwargs,
   1072         )
   1073 

~/Desktop/hf/nlp/src/datasets/arrow_dataset.py in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
   1037                 )
   1038             else:
-> 1039                 data_subset = self._data.take(indices_array)
   1040 
   1041             if format_type is not None:

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.take()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/compute.py in take(data, indices, boundscheck)
    266     """
    267     options = TakeOptions(boundscheck)
--> 268     return call_function('take', [data, indices], options)
    269 
    270 

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/_compute.pyx in pyarrow._compute.call_function()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: offset overflow while concatenating arrays

It seems to work fine with small datasets or with pyarrow 0.17.1

joeddav commented 4 years ago

Related: https://issues.apache.org/jira/browse/ARROW-9773

It's definitely a size thing. I took a smaller dataset with 87000 rows and did:

for i in range(10,1000,20):
    table = pa.concat_tables([dset._data]*i)
    table.take([0])

and it broke at around i=300.

Also when _indices is not None, this breaks indexing by slice. E.g. dset.shuffle()[:1] breaks.

Luckily so far I haven't seen _indices.column(0).take break, which means it doesn't break select or anything like that which is where the speed really matters, it's just _getitem. So I'm currently working around it by just doing the arrow v0 method in _getitem:

#if PYARROW_V0:
data_subset = pa.concat_tables(
    self._data.slice(indices_array[i].as_py(), 1) for i in range(len(indices_array))
)
#else:
    #data_subset = self._data.take(indices_array)
lhoestq commented 3 years ago

Let me know if you meet other offset overflow issues @joeddav

Cppowboy commented 2 years ago

Will this problem be solved in newer version?

lhoestq commented 2 years ago

This specific issue has been fixed in https://github.com/huggingface/datasets/pull/645

If you still have this error, could you open a new issue and explain how to reproduce the error ?

bestpredicts commented 1 year ago

same error here in version 2.1.0

nishanthcgit commented 1 year ago

Facing the same issue. Steps to reproduce: (dataset is a few GB big so try in colab maybe) Datasets version - 2.11.0

import datasets
import re

ds = datasets.load_dataset('nishanthc/dnd_map_dataset_v0.1', split = 'train')

def get_text_caption(example):
    regex_pattern = r'\s\d+x\d+|,\sLQ|,\sgrid|\.\w+$'
    example['text_caption'] = re.sub(regex_pattern, '', example['picture_text'])
    return example

ds = ds.map(get_text_caption)

I am trying to apply a regex to remove certain patterns from a text column. Not sure why this error is showing up.

afogarty85 commented 1 year ago

Got this error on a very large data set (900m rows, 35 cols) performing a similar batch map operation.

lhoestq commented 1 year ago

There is a solution that has been proposed here: https://github.com/huggingface/datasets/issues/5783

aihao2000 commented 1 year ago

@lhoestq I ran into this problem with load_dataset. What should I do

lhoestq commented 12 months ago

What version of datasets are you using ? Feel free to open a new issue with some details (e.g. what dataset you loaded, what code you ran etc)

aihao2000 commented 12 months ago

@lhoestq It's been solved,thanks

jpiabrantes commented 5 months ago

I am facing this problem.

Here's my code:

model.eval()
model.to('cuda')
block_size = tokenizer.model_max_length
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    with torch.no_grad():
        input_ids = torch.tensor(result["input_ids"]).to('cuda')
        attention_mask = torch.tensor(result["attention_mask"]).to('cuda')
        r = model.forward(input_ids=input_ids, attention_mask=attention_mask)
    result["labels"] = r.logits.cpu().numpy().tolist()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=1,
)

This works for a few iterations and then gives the error:

Traceback (most recent call last):
  File "/home/jpiabrantes/rosetta/tmp.py", line 45, in <module>
    lm_datasets = tokenized_datasets.map(
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/dataset_dict.py", line 868, in map
    {
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/dataset_dict.py", line 869, in <dictcomp>
    k: dataset.map(
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3105, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single
    writer.write_batch(batch)
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_batch
    self.write_table(pa_table, writer_batch_size)
  File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 583, in write_table
    pa_table = pa_table.combine_chunks()
  File "pyarrow/table.pxi", line 3638, in pyarrow.lib.Table.combine_chunks
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
lhoestq commented 5 months ago

Hi ! What version of pyarrow are you using ? Also what's the lengths of your texts ?

jpiabrantes commented 5 months ago

@lhoestq pyarrow version: 15.0.2

lengths of texts are 1024 tokens.

dirtycomputer commented 4 months ago
import pandas as pd
from datasets import Dataset,Image

# Read the CSV file
df = pd.read_csv("MedMQ-2k/metadata.csv")
# Create a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

dataset = dataset.map(lambda example: {"image": example["file_name"]}, batched=True)
# Convert the file_name column to Image type
dataset = dataset.cast_column("image", Image())

# Upload to Hugging Face Hub (make sure authentication is set up)
dataset.push_to_hub("MedMLLM-attack/3MAD-24K", num_shards=16)
截屏2024-05-02 13 04 09 截屏2024-05-02 13 03 40

same problem here

dirtycomputer commented 4 months ago

problem solved by using dataset split, but i don't know what's different between "split and subset"