huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

Tensor type (e.g. from `return_tensors`) ignored in map #6688

Open srossi93 opened 8 months ago

srossi93 commented 8 months ago

Describe the bug

I don't know if it is a bug or an expected behavior, but the tensor type seems to be ignored after applying map. For example, mapping over to tokenize text with a transformers' tokenizer always returns lists and it ignore the return_tensors argument.

If this is an expected behaviour (e.g., for caching/Arrow compatibility/etc.) it should be clearly documented. For example, current documentation (see here) clearly state to "set return_tensors="np" when you tokenize your text" to have Numpy arrays.

Steps to reproduce the bug

# %%%

import datasets
import numpy as np
import tensorflow as tf
import torch
from transformers import AutoTokenizer

# %%
ds = datasets.load_dataset("cnn_dailymail", "1.0.0", split="train[:1%]")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

#%%
for return_tensors in [None, "np", "pt", "tf", "jax"]:
  print(f"********** no map, return_tensors={return_tensors} **********")
  _ds = tokenizer(ds["article"], return_tensors=return_tensors, truncation=True, padding=True)
  print('Type <input_ids>:', type(_ds["input_ids"]))

# %%
for return_tensors in [None, "np", "pt", "tf", "jax"]:
  print(f"********** map, return_tensors={return_tensors} **********")
  _ds = ds.map(
    lambda examples: tokenizer(examples["article"], return_tensors=return_tensors, truncation=True, padding=True),
    batched=True,
    remove_columns=["article"],
  )

  print('Type <input_ids>:', type(_ds[0]["input_ids"]))

Expected behavior

The output from the script above. I would expect the second half to be the same.

********** no map, return_tensors=None **********
Type <input_ids>: <class 'list'>
********** no map, return_tensors=np **********
Type <input_ids>: <class 'numpy.ndarray'>
********** no map, return_tensors=pt **********
Type <input_ids>: <class 'torch.Tensor'>
********** no map, return_tensors=tf **********
Type <input_ids>: <class 'tensorflow.python.framework.ops.EagerTensor'>
********** no map, return_tensors=jax **********
Type <input_ids>: <class 'jaxlib.xla_extension.ArrayImpl'>

********** map, return_tensors=None **********
Type <input_ids>: <class 'list'>
********** map, return_tensors=np **********
Type <input_ids>: <class 'list'>
********** map, return_tensors=pt **********
Type <input_ids>: <class 'list'>
********** map, return_tensors=tf **********
Type <input_ids>: <class 'list'>
********** map, return_tensors=jax **********
Type <input_ids>: <class 'list'>

Environment info

lhoestq commented 8 months ago

Hi, this is expected behavior since all the tensors are converted to Arrow data (the storage type behind a Dataset).

To get pytorch tensors back, you can set the dataset format to "torch":

ds = ds.with_format("torch")
srossi93 commented 8 months ago

Thanks. Just one additional question. During the pipeline <framework> -> arrow -> <framework>, does .with_format zero-copies the tensors or is it a deep copy? And is this behavior framework-dependent?

Thanks again.

lhoestq commented 8 months ago

We do zero-copy Arrow <-> NumPy <-> PyTorch when the output dtype matches the original dtype, but for other frameworks it depends. For example JAX doesn't allow zero-copy NumPy -> JAX at all IIRC.

Currently tokenized data are formatted using a copy though, since tokens are stored as int32 and returned as int64 torch tensors.