nlp-with-transformers / notebooks

Jupyter notebooks for the Natural Language Processing with Transformers book
https://transformersbook.com/
Apache License 2.0
3.91k stars 1.22k forks source link

Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

Open rshokeen opened 8 months ago

rshokeen commented 8 months ago

Information

While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.

The problem arises in chapter:

Describe the bug

1) While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']

2) After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.

To Reproduce

Steps to reproduce the behavior:

  1. def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True)

  2. Call the tokenize function emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

  3. Print column names print(emotions_encoded["train"].column_names)

    output: ['input_ids', 'attention_mask']

    **Expected behavior:** INSTEAD IT SHOULD BE ['attention_mask', 'input_ids', 'label', 'text']
  4. Print emotions_encoded print(emotions_encoded)

    output: DatasetDict({ train: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 151 }) validation: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 144 }) test: Dataset({ features: ['input_ids', 'attention_mask'], num_rows: 152 }) })

    Expected behavior: the number of rows in train dataset is 151 whereas the original dataset rows in emotion["train"] dataset is 16,000.