Why return_tensors='pt' doesn't work？

bw-wang19 commented 6 days ago

Describe the bug

I tried to add input_ids to dataset with map(), and I used the return_tensors='pt', but why I got the callback with the type of List？

Steps to reproduce the bug

Expected behavior

Sorry for this silly question, I'm noob on using this tool. But I think it should return a tensor value as I have used the protocol？ When I tokenize only one sentence using tokenized_input=tokenizer(input, return_tensors='pt' )，it does return in tensor type. Why doesn't it work in map()？

Environment info

transformers>=4.41.2,<=4.45.0 datasets>=2.16.0,<=2.21.0 accelerate>=0.30.1,<=0.34.2 peft>=0.11.1,<=0.12.0 trl>=0.8.6,<=0.9.6 gradio>=4.0.0 pandas>=2.0.0 scipy einops sentencepiece tiktoken protobuf uvicorn pydantic fastapi sse-starlette matplotlib>=3.7.0 fire packaging pyyaml numpy<2.0.0

lhoestq commented 3 days ago

Hi ! datasets uses Arrow as storage backend which is agnostic to deep learning frameworks like torch. If you want to get torch tensors back, you need to do dataset = dataset.with_format("torch")

bw-wang19 commented 3 days ago

Hi ! datasets uses Arrow as storage backend which is agnostic to deep learning frameworks like torch. If you want to get torch tensors back, you need to do dataset = dataset.with_format("torch")

It does work! Thanks for your suggestion!

huggingface / datasets