huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.18k stars 26.59k forks source link

LEDTokenizer doesn't pad `global_attention_mask` #14648

Closed parambharat closed 2 years ago

parambharat commented 2 years ago

Environment info

Who can help

Models: @patrickvonplaten

Information

Model I am using (LEDTokenizer, LEDSeq2SeqLM):

The problem arises when using:

[* ] my own modified scripts:

model_name = "allenai/led-base-16384"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_examples(examples):
    inputs = ["\n".join(document) for document in examples["document"]]
    targets = ["\n".join(document) for document in examples["summary"]]

    model_inputs = tokenizer(inputs, max_length=tokenizer.model_max_length, padding=False, truncation=True)

    model_inputs["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs["input_ids"]]
    # put global attention on <s> token
    for input in model_inputs["global_attention_mask"][:]:
        input[0] = 1

    model_inputs["global_attention_mask"] = model_inputs["global_attention_mask"]
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=512, padding=False, truncation=True,)

    labels["input_ids"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

def preprocess_dataset(dataset):
    return (dataset
            .map(tokenize_examples, batched=True, batch_size=5, num_proc=2,)
            .shuffle())

batch_size=2
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    max_length=tokenizer.model_max_length,
    pad_to_multiple_of=8,
    label_pad_token_id = -100,)

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    dataloader_drop_last=True,
    group_by_length=True,
    # fp16=True,
    output_dir="./models/led-16k",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["valid"],
)

To reproduce

Steps to reproduce the behavior:

  1. Try dynamic padding with using the DataCollator as show above.
  2. get the following is the error log.
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    704                 if not is_tensor(value):
--> 705                     tensor = as_tensor(value)
    706 

ValueError: expected sequence of length 4096 at dim 1 (got 3157)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
<ipython-input-26-c3d8f1eba49d> in <module>()
      1 # start training
      2 torch.cuda.empty_cache()
----> 3 trainer.train()

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1288             self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control)
   1289 
-> 1290             for step, inputs in enumerate(epoch_iterator):
   1291 
   1292                 # Skip past any already trained steps if resuming training

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     50         else:
     51             data = self.dataset[possibly_batched_index]
---> 52         return self.collate_fn(data)

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in __call__(self, features, return_tensors)
    564             max_length=self.max_length,
    565             pad_to_multiple_of=self.pad_to_multiple_of,
--> 566             return_tensors=return_tensors,
    567         )
    568 

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2794                 batch_outputs[key].append(value)
   2795 
-> 2796         return BatchEncoding(batch_outputs, tensor_type=return_tensors)
   2797 
   2798     def create_token_type_ids_from_sequences(

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    208         self._n_sequences = n_sequences
    209 
--> 210         self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
    211 
    212     @property

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    720                     )
    721                 raise ValueError(
--> 722                     "Unable to create tensor, you should probably activate truncation and/or padding "
    723                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    724                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

The error was not very helpful since the tokenizer is instantiated with padding and truncation and max_length params. The problem was with the global_attention_mask not being padded in the DataLoader here

I was able to pinpoint the problem to the padding in the tokenizer. Where model inputs to be padded doesn't include global_attention_mask https://github.com/huggingface/transformers/blob/75ae287aecf20a37c232a41e25443a3421a8b5e2/src/transformers/tokenization_utils_base.py#L3121-L3150

i changed lines L3128-L3131 (https://github.com/huggingface/transformers/blob/75ae287aecf20a37c232a41e25443a3421a8b5e2/src/transformers/tokenization_utils_base.py#L3128-L3131) to the following and everything worked.


            if self.padding_side == "right":
                if return_attention_mask:

                    encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
                    encoded_inputs["global_attention_mask"] = (
                        encoded_inputs["global_attention_mask"] + [0] * difference
                    )
                if "token_type_ids" in encoded_inputs:

Expected behavior

My fix was a hack for sure. We could perhaps override the _pad method in the LEDTokenizer to work with padding LED's global_attention_mask

something like the following.

from transformers.file_utils import PaddingStrategy
from typing import Optional, List, Union, Dict
from transformers.tokenization_utils_base import EncodedInput, BatchEncoding

# EncodedInput = List[int]

class LEDTokenizerFixed(LEDTokenizer):

  def _pad(
        self,
        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
        max_length: Optional[int] = None,
        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
    ) -> dict:
        """
        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)

        Args:
            encoded_inputs: Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
            max_length: maximum length of the returned list and optionally padding length (see below).
                Will truncate by taking into account the special tokens.
            padding_strategy: PaddingStrategy to use for padding.

                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
                - PaddingStrategy.DO_NOT_PAD: Do not pad
                The tokenizer padding sides are defined in self.padding_side:

                    - 'left': pads on the left of the sequences
                    - 'right': pads on the right of the sequences
            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
                >= 7.5 (Volta).
            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)
        """
        # Load from model defaults
        if return_attention_mask is None:
            return_attention_mask = "attention_mask" in self.model_input_names

        required_input = encoded_inputs[self.model_input_names[0]]

        if padding_strategy == PaddingStrategy.LONGEST:
            max_length = len(required_input)

        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of

        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length

        # Initialize attention mask if not present.
        if return_attention_mask and "attention_mask" not in encoded_inputs:
            encoded_inputs["attention_mask"] = [1] * len(required_input)

        if needs_to_be_padded:
            difference = max_length - len(required_input)

            if self.padding_side == "right":
                if return_attention_mask:

                    encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
                    encoded_inputs["global_attention_mask"] = (
                        encoded_inputs["global_attention_mask"] + [0] * difference
                    )
                if "token_type_ids" in encoded_inputs:
                    encoded_inputs["token_type_ids"] = (
                        encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
                    )
                if "special_tokens_mask" in encoded_inputs:
                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
                encoded_inputs[self.model_input_names[0]] = required_input + [self.pad_token_id] * difference
            elif self.padding_side == "left":
                if return_attention_mask:
                    encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
                    encoded_inputs["global_attention_mask"] = [0] * difference + encoded_inputs["global_attention_mask"]
                if "token_type_ids" in encoded_inputs:
                    encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
                        "token_type_ids"
                    ]
                if "special_tokens_mask" in encoded_inputs:
                    encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
                encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
            else:
                raise ValueError("Invalid padding strategy:" + str(self.padding_side))

        return encoded_inputs
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

parambharat commented 2 years ago

@patrickvonplaten : is this something I can contribute towards ?

patrickvonplaten commented 2 years ago

Hey @parambharat,

It would be great if you could try to fix the problem by opening a PR! More than happy to take a look :-)

patrickvonplaten commented 2 years ago

@ydshieh think you've worked with LED quite a bit recently. Could you take a look here? :-)

ydshieh commented 2 years ago

Sure, I will try it 😎

ydshieh commented 2 years ago

Hi @parambharat , @patrickvonplaten

I looked this issue, and think @parambharat 's suggestion makes sense, but need to be refined:

In the method _pad, just like attention_mask, we need to deal with the case "global_attention_mask" not in encoded_inputs:

  1. either provide a default value
  2. or just do nothing, and not include global_attention_mask to the outputs

If we go for option 1, I think it is more logical to include global_attention_mask in the output of encode and other tokenizer method. But I prefer not to override many methods. So I will go for option 2 (i.e. not return global_attention_mask if it is not provided by the user).

patrickvonplaten commented 2 years ago

I understand the issue now much better - thanks for clarifying! IMO it's a good idea to overwrite the _pad method in the tokenizer and I agree with @ydshieh that option 2.) is simpler makes more sense here! @parambharat would you be interested in opening a PR here or @ydshieh maybe? :-)

ydshieh commented 2 years ago

Let's see if @parambharat would like (or have time) to contribute first. Otherwise, I can work on it.

ydshieh commented 2 years ago

@parambharat

This issue is finally fixed in #15940.