Open srossi93 opened 8 months ago
Hi, this is expected behavior since all the tensors are converted to Arrow data (the storage type behind a Dataset).
To get pytorch tensors back, you can set the dataset format to "torch":
ds = ds.with_format("torch")
Thanks. Just one additional question. During the pipeline <framework> -> arrow -> <framework>
, does .with_format
zero-copies the tensors or is it a deep copy? And is this behavior framework-dependent?
Thanks again.
We do zero-copy Arrow <-> NumPy <-> PyTorch when the output dtype matches the original dtype, but for other frameworks it depends. For example JAX doesn't allow zero-copy NumPy -> JAX at all IIRC.
Currently tokenized data are formatted using a copy though, since tokens are stored as int32 and returned as int64 torch tensors.
Describe the bug
I don't know if it is a bug or an expected behavior, but the tensor type seems to be ignored after applying map. For example, mapping over to tokenize text with a transformers' tokenizer always returns lists and it ignore the
return_tensors
argument.If this is an expected behaviour (e.g., for caching/Arrow compatibility/etc.) it should be clearly documented. For example, current documentation (see here) clearly state to "set
return_tensors="np"
when you tokenize your text" to have Numpy arrays.Steps to reproduce the bug
Expected behavior
The output from the script above. I would expect the second half to be the same.
Environment info
datasets
version: 2.17.1huggingface_hub
version: 0.20.3fsspec
version: 2023.10.0