huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Why return_tensors='pt' doesn't work? #7291

Open bw-wang19 opened 6 days ago

bw-wang19 commented 6 days ago

Describe the bug

I tried to add input_ids to dataset with map(), and I used the return_tensors='pt', but why I got the callback with the type of List? image

Steps to reproduce the bug

image

Expected behavior

Sorry for this silly question, I'm noob on using this tool. But I think it should return a tensor value as I have used the protocol? When I tokenize only one sentence using tokenized_input=tokenizer(input, return_tensors='pt' ),it does return in tensor type. Why doesn't it work in map()?

Environment info

transformers>=4.41.2,<=4.45.0 datasets>=2.16.0,<=2.21.0 accelerate>=0.30.1,<=0.34.2 peft>=0.11.1,<=0.12.0 trl>=0.8.6,<=0.9.6 gradio>=4.0.0 pandas>=2.0.0 scipy einops sentencepiece tiktoken protobuf uvicorn pydantic fastapi sse-starlette matplotlib>=3.7.0 fire packaging pyyaml numpy<2.0.0

lhoestq commented 3 days ago

Hi ! datasets uses Arrow as storage backend which is agnostic to deep learning frameworks like torch. If you want to get torch tensors back, you need to do dataset = dataset.with_format("torch")

bw-wang19 commented 3 days ago

Hi ! datasets uses Arrow as storage backend which is agnostic to deep learning frameworks like torch. If you want to get torch tensors back, you need to do dataset = dataset.with_format("torch")

It does work! Thanks for your suggestion!