While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.
The problem arises in chapter:
[ ] Introduction
[ x] Text Classification
[ ] Transformer Anatomy
[ ] Multilingual Named Entity Recognition
[ ] Text Generation
[ ] Summarization
[ ] Question Answering
[ ] Making Transformers Efficient in Production
[ ] Dealing with Few to No Labels
[ ] Training Transformers from Scratch
[ ] Future Directions
Describe the bug
1) While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']
2) After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.
Information
While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.
The problem arises in chapter:
Describe the bug
1) While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']
2) After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.
To Reproduce
Steps to reproduce the behavior:
def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True)
Call the tokenize function emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
Print column names print(emotions_encoded["train"].column_names)
output: ['input_ids', 'attention_mask']
Print emotions_encoded print(emotions_encoded)
output: DatasetDict({ train: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 151 }) validation: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 144 }) test: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 152 }) })
Expected behavior: the number of rows in train dataset is 151 whereas the original dataset rows in emotion["train"] dataset is 16,000.