huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.82k stars 2.6k forks source link

map() not recognizing "text" #6287

Closed EngineerKhan closed 9 months ago

EngineerKhan commented 9 months ago

Describe the bug

The map() documentation reads: ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)

I have been trying to reproduce it in my code as:

tokenizedDataset = dataset.map(lambda x: tokenizer(x['text']), batched=True)

But it doesn't work as it throws the error:

KeyError: 'text'

Can you please guide me on how to fix it?

Steps to reproduce the bug

  1. `from datasets import load_dataset

dataset = load_dataset("amazon_reviews_multi")`

  1. Then this code: `from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")`

  1. The line I quoted above (which I have been trying)

Expected behavior

As mentioned in the documentation, it should run without any error and map the tokenization on the whole dataset.

Environment info

Python 3.10.2

mariosasko commented 9 months ago

There is no "text" column in the amazon_reviews_multi, hence the KeyError. You can get the column names by running dataset.column_names.