Closed lhoestq closed 3 years ago
Related: https://issues.apache.org/jira/browse/ARROW-9773
It's definitely a size thing. I took a smaller dataset with 87000 rows and did:
for i in range(10,1000,20):
table = pa.concat_tables([dset._data]*i)
table.take([0])
and it broke at around i=300.
Also when _indices
is not None, this breaks indexing by slice. E.g. dset.shuffle()[:1]
breaks.
Luckily so far I haven't seen _indices.column(0).take
break, which means it doesn't break select
or anything like that which is where the speed really matters, it's just _getitem
. So I'm currently working around it by just doing the arrow v0 method in _getitem
:
#if PYARROW_V0:
data_subset = pa.concat_tables(
self._data.slice(indices_array[i].as_py(), 1) for i in range(len(indices_array))
)
#else:
#data_subset = self._data.take(indices_array)
Let me know if you meet other offset overflow issues @joeddav
Will this problem be solved in newer version?
This specific issue has been fixed in https://github.com/huggingface/datasets/pull/645
If you still have this error, could you open a new issue and explain how to reproduce the error ?
same error here in version 2.1.0
Facing the same issue. Steps to reproduce: (dataset is a few GB big so try in colab maybe) Datasets version - 2.11.0
import datasets
import re
ds = datasets.load_dataset('nishanthc/dnd_map_dataset_v0.1', split = 'train')
def get_text_caption(example):
regex_pattern = r'\s\d+x\d+|,\sLQ|,\sgrid|\.\w+$'
example['text_caption'] = re.sub(regex_pattern, '', example['picture_text'])
return example
ds = ds.map(get_text_caption)
I am trying to apply a regex to remove certain patterns from a text column. Not sure why this error is showing up.
Got this error on a very large data set (900m rows, 35 cols) performing a similar batch map operation.
There is a solution that has been proposed here: https://github.com/huggingface/datasets/issues/5783
@lhoestq I ran into this problem with load_dataset. What should I do
What version of datasets
are you using ? Feel free to open a new issue with some details (e.g. what dataset you loaded, what code you ran etc)
@lhoestq It's been solved,thanks
I am facing this problem.
Here's my code:
model.eval()
model.to('cuda')
block_size = tokenizer.model_max_length
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
with torch.no_grad():
input_ids = torch.tensor(result["input_ids"]).to('cuda')
attention_mask = torch.tensor(result["attention_mask"]).to('cuda')
r = model.forward(input_ids=input_ids, attention_mask=attention_mask)
result["labels"] = r.logits.cpu().numpy().tolist()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=1,
)
This works for a few iterations and then gives the error:
Traceback (most recent call last):
File "/home/jpiabrantes/rosetta/tmp.py", line 45, in <module>
lm_datasets = tokenized_datasets.map(
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/dataset_dict.py", line 868, in map
{
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/dataset_dict.py", line 869, in <dictcomp>
k: dataset.map(
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3105, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single
writer.write_batch(batch)
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_batch
self.write_table(pa_table, writer_batch_size)
File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 583, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 3638, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
Hi ! What version of pyarrow
are you using ? Also what's the lengths of your texts ?
@lhoestq pyarrow version: 15.0.2
lengths of texts are 1024 tokens.
import pandas as pd
from datasets import Dataset,Image
# Read the CSV file
df = pd.read_csv("MedMQ-2k/metadata.csv")
# Create a Hugging Face Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(lambda example: {"image": example["file_name"]}, batched=True)
# Convert the file_name column to Image type
dataset = dataset.cast_column("image", Image())
# Upload to Hugging Face Hub (make sure authentication is set up)
dataset.push_to_hub("MedMLLM-attack/3MAD-24K", num_shards=16)
same problem here
problem solved by using dataset split, but i don't know what's different between "split and subset"
How to reproduce:
It seems to work fine with small datasets or with pyarrow 0.17.1