Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers

Information

While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.

The problem arises in chapter:

[ ] Introduction
[ x] Text Classification
[ ] Transformer Anatomy
[ ] Multilingual Named Entity Recognition
[ ] Text Generation
[ ] Summarization
[ ] Question Answering
[ ] Making Transformers Efficient in Production
[ ] Dealing with Few to No Labels
[ ] Training Transformers from Scratch
[ ] Future Directions

Describe the bug

1) While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']

2) After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.

To Reproduce

Steps to reproduce the behavior:

def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True)
Call the tokenize function emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
Print column names print(emotions_encoded["train"].column_names)

output: ['input_ids', 'attention_mask']
```
**Expected behavior:** INSTEAD IT SHOULD BE ['attention_mask', 'input_ids', 'label', 'text']
```
Print emotions_encoded print(emotions_encoded)

output: DatasetDict({ train: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 151 }) validation: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 144 }) test: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 152 }) })

Expected behavior: the number of rows in train dataset is 151 whereas the original dataset rows in emotion["train"] dataset is 16,000.

nlp-with-transformers / notebooks