Tokenizing in the dataset and padding manually using tokenizer.pad in the collator

jandono commented 3 years ago

Environment info

transformers version: 4.2.2
Platform: Linux-5.4.0-74-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

I am trying to avoid tokenizing in the collator in order to improve the speed of Data Loading, which is why I wanted to tokenize everything in advance and then simply pad in the collator. I can't provide the entire code however here are my Dataset and my Collator which will hopefully be enough.

class DatasetTokenized(Dataset):

    def __init__(self, data: pd.DataFrame, text_column: str,
                 label_columns: List[str], tokenizer_name: str):
        super(DatasetTokenized, self).__init__()

        self.data = data
        self.text_column = text_column
        self.label_columns = label_columns
        self.tokenizer = BertTokenizer.from_pretrained(tokenizer_name)
        self.tokenized_data = self.tokenize_data(data)

    def __len__(self) -> int:
        return len(self.tokenized_data)

    def __getitem__(self, index: int) -> Dict:
        return self.tokenized_data[index]

    def tokenize_data(self, data: pd.DataFrame):
        tokenized_data = []

        print('Tokenizing data:')
        for _, row in tqdm(data.iterrows(), total=len(data)):

            text = row[self.text_column]
            labels = row[self.label_columns]
            encoding = self.tokenizer(text,
                                      add_special_tokens=True,
                                      max_length=512,
                                      padding=False,
                                      truncation=True,
                                      return_attention_mask=True,
                                      return_tensors='pt')

            tokenized_data.append({
                'text': text,
                'encoding': encoding,
                'labels': torch.FloatTensor(labels)
            })

        return tokenized_data

class BertCollatorTokenized:

    def __init__(self, tokenizer_name: str):
        super(BertCollatorTokenized, self).__init__()

        self.tokenizer = BertTokenizer.from_pretrained(tokenizer_name)

    def __call__(self, batch: List[Any]):
        text, encodings, labels = zip(
            *[[sample['text'], sample['encoding'], sample['labels']]
              for sample in batch])

        encodings = list(encodings)
        encodings = self.tokenizer.pad(encodings,
                                       max_length=512,
                                       padding='longest',
                                       return_tensors='pt')

        return {
            'text': text,
            'input_ids': encodings['input_ids'],
            'attention_mask': encodings['attention_mask'],
            'labels': torch.FloatTensor(labels)
        }

Error:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Full error message:

  File "train_text_classificator.py", line 78, in main
    trainer.fit(lightning_system, data_module)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check
    self.run_evaluation()
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in run_evaluation
    for batch_idx, batch in enumerate(dataloader):
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/tokenization_utils_base.py", line 771, in convert_to_tensors
    tensor = as_tensor(value)
ValueError: expected sequence of length 4 at dim 2 (got 13)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/jav/experimental-framework/data_utils/collators/transformers_collatоrs.py", line 97, in __call__
    return_tensors='pt')
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/tokenization_utils_base.py", line 2706, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/tokenization_utils_base.py", line 276, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/jav/anaconda3/envs/experimental_framework/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/tokenization_utils_base.py", line 788, in convert_to_tensors
    "Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Expected behavior

I would expect self.tokenizer.pad(encodings, ... ) in the collator to work without issues when given a list of BatchEncoding elements.

jandono commented 3 years ago

Some additional info that might help. Encodings looks like: encodings = [batch_encoding_1, ... , batch_encoding_2]. Each batch encoding looks like:

{'input_ids': tensor([[  101,  1006,  1039,  1007,  2065,  1996, 13666, 11896,  2000, 14037,
          2007,  2019, 14987,  2104,  2023, 11075,  3429,  1010,  1999,  2804,
          2000,  2151,  2060,  2128,  7583,  3111,  1997,  1996,  4054,  1010,
          1996,  4054,  2089,  4685,  2008, 14987,  2006,  1996, 13666,  1005,
          1055,  6852,  1998,  2151,  3465, 22667,  2011,  1996,  4054,  2097,
          2022,  1037,  7016,  2349,  2013,  1996, 13666,  2000,  1996,  4054,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

And this is the line that eventually raises an exception:

https://github.com/huggingface/transformers/blob/1498eb9888d55d76385b45e074f26703cc5049f3/src/transformers/tokenization_utils_base.py#L699

jandono commented 3 years ago

I managed to make a small reproducible example:


from transformers import BertTokenizer
from torch import tensor

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

encodings = [{'input_ids': tensor([[  101,  1006,  1039,  1007,  2065,  1996, 13666, 11896,  2000, 14037,
          2007,  2019, 14987,  2104,  2023, 11075,  3429,  1010,  1999,  2804,
          2000,  2151,  2060,  2128,  7583,  3111,  1997,  1996,  4054,  1010,
          1996,  4054,  2089,  4685,  2008, 14987,  2006,  1996, 13666,  1005,
          1055,  6852,  1998,  2151,  3465, 22667,  2011,  1996,  4054,  2097,
          2022,  1037,  7016,  2349,  2013,  1996, 13666,  2000,  1996,  4054,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[ 101, 1006, 1037, 1007, 2202, 2046, 4070, 2035, 1997, 1996, 7882, 6214,
         1997, 1996, 3563, 3105, 1010, 2164, 1996, 3872, 2030, 3635, 1997, 1996,
         7170, 2000, 2022, 2333, 1010, 3292, 2000, 2022, 7837, 2005, 4651, 1010,
         2334, 4026, 3785, 1010, 2051, 1997, 2154, 1010, 3517, 3403, 2335, 1010,
         2569, 2609, 3785, 1998, 2060, 2569, 6214, 1025, 1998,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}]

batched_encodings = tokenizer.pad(encodings, padding='longest', return_tensors='pt')

jandono commented 3 years ago

@LysandreJik any update on this?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jandono commented 3 years ago

@LysandreJik @patrickvonplaten @sgugger

I apologize for tagging Patric and Sylvain, but as Lysandre seems to be busy, do you perhaps know someone who can help with this?

sgugger commented 3 years ago

The tokenizer.pad method only applies padding for list of examples, so each of the elements in your encoding should be one-dimensional. If you remove the extra pair of [] in all your tensors in your minimal example, it will work.

Also please use the forums for questions around the library as we keep the issues for bugs and feature requests only.

jandono commented 3 years ago

Thanks a lot @sgugger , I posted it her as it looked like a bug to me based on the documentation. Additionally, those extra set of parenthesis come from the tokenizer not me. When running:

encoding = self.tokenizer(text,
                          add_special_tokens=True,
                          max_length=512,
                          padding=False,
                          truncation=True,
                          return_attention_mask=True,
                          return_tensors='pt')

You get those extra parenthesis. I am assuming they come because in the background of the __call__ method, batch_encode is called instead of encode. Am I doing something wrong in the way I am using the tokenizer? My main goal is to simply tokenize the entire dataset beforehand, and only pad during training.

sgugger commented 3 years ago

You should not use return_tensors='pt' for just one text, that option is designed to create batches you directly pass to your model. So if you use it with one text, you get a batch of one encoding. Either add [0] to select the only element of that batch in your dataset, or create the tensors in the collate function.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers