[Efficiency] Decoding can be made faster by not converting special tokens to ids for each token.

ganeshpatelQB commented 1 year ago

System Info

transformers version: 4.29.0.dev0
Platform: macOS-14.0-arm64-arm-64bit
Python version: 3.11.4
Huggingface_hub version: 0.13.3
Safetensors version: not installed
PyTorch version (GPU?): 2.0.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The following function is being called for each token while using decoding function.


from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained(TOKENIZER_PATH)
beams = tokenizer.batch_decode(
    outputs, skip_special_tokens=True
)

  @property
  def all_special_ids(self) -> List[int]:
      """
      `List[int]`: List the ids of the special tokens(`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
      """
      all_toks = self.all_special_tokens
      all_ids = self.convert_tokens_to_ids(all_toks)
      return all_ids

Expected behavior

all_special_ids should not be called for each token while decoding at the time of inferencing.

ArthurZucker commented 1 year ago

Very good catch! I'll open a pr for this. Affect both convert_ids_to_tokens and decode. 🤗 I need to do some benchmarking as I suspect this does won't have a huge impact but will give it a shot. I plan to benchmark our full calls to make sure we don't have things similar to this else where

ArthurZucker commented 11 months ago

My initial tests did not show any impact with NLLB and whisper which have the most amount of added tokens, but I'll try to optimize and benchmark in a near futur!

huggingface / transformers