huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.13k stars 2.66k forks source link

IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number #5539

Closed aalbersk closed 1 year ago

aalbersk commented 1 year ago

Describe the bug

When dataset contains a 0-dim tensor, formatting.py raises a following error and fails.

Traceback (most recent call last):
  File "<path>/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 501, in format_row
    return _unnest(formatted_batch)
  File "<path>/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 137, in _unnest
    return {key: array[0] for key, array in py_dict.items()}
  File "<path>/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 137, in <dictcomp>
    return {key: array[0] for key, array in py_dict.items()}
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

Steps to reproduce the bug

Load whichever dataset and add transform method to add 0-dim tensor. Or create/find a dataset containing 0-dim tensor. E.g.

from datasets import load_dataset
import torch

dataset = load_dataset("lambdalabs/pokemon-blip-captions", split='train')
def t(batch):
    return {"test": torch.tensor(1)}

dataset.set_transform(t)
d_0 = dataset[0]

Expected behavior

Extractor will correctly get a row from the dataset, even if it contains 0-dim tensor.

Environment info

datasets==2.8.0, but it looks like it is also applicable to main branch version (as of 16th February)

mariosasko commented 1 year ago

Hi! The set_transform does not apply a custom formatting transform on a single example but the entire batch, so the fixed version of your transform would look as follows:

from datasets import load_dataset
import torch

dataset = load_dataset("lambdalabs/pokemon-blip-captions", split='train')
def t(batch):
    return {"test": torch.tensor([1] * len(batch[next(iter(batch))]))}

dataset.set_transform(t)
d_0 = dataset[0]

Still, the formatter's error message should mention that a dict of sequences is expected as the returned value (not just a dict) to make debugging easier.

Plutone11011 commented 1 year ago

I can take this

mariosasko commented 1 year ago

Fixed in #5553

aalbersk commented 1 year ago

Hi! The set_transform does not apply a custom formatting transform on a single example but the entire batch, so the fixed version of your transform would look as follows:

from datasets import load_dataset
import torch

dataset = load_dataset("lambdalabs/pokemon-blip-captions", split='train')
def t(batch):
    return {"test": torch.tensor([1] * len(batch[next(iter(batch))]))}

dataset.set_transform(t)
d_0 = dataset[0]

Still, the formatter's error message should mention that a dict of sequences is expected as the returned value (not just a dict) to make debugging easier.

ok, will change it according to suggestion. Thanks for the reply!