Different result when using model.generate or translation pipeline (and TypeError when argument 'forced_bos_token_id' is set in pipeline)

cgr71ii commented 2 weeks ago

System Info

transformers version: 4.43.3
Platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35
Python version: 3.11.9
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA A100-PCIE-40GB

Who can help?

@gante @Narsil

Hi!

I'm generating some translations from English to German using NLLB, but I noticed different results using model.generate or pipeline, but only for a single instance: "Soon there will be new containers in Wasenstraße". The translations are:

pipeline: Bald werden neue Container in Wasenstraße entstehen .
model.generate: Bald werden neue Container in Wasenstraße entstehen.

The only difference is the final space before the period. I think the problem is related to the pipeline for two reasons:

I can't set forced_bos_token_id in the pipeline, but I can do it in model.generate. I think this may be causing the difference between pipeline and model.generate.
I've implemented my own generate, and it generates the same result as model.generate. Of course, my code may be wrong.

The raised error when I try to set forced_bos_token_id in the pipeline is:

TypeError: transformers.generation.utils.GenerationMixin.generate() got multiple values for keyword argument 'forced_bos_token_id'

Thank you!

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Code to reproduce the TypeError (set force_type_error=True) and the different translations (set force_type_error=False):

import torch
import transformers

force_type_error = True # TODO change to False to see the other issue

# Variables
source_text = "Soon there will be new containers in Wasenstraße"
source_lang = "eng_Latn"
target_lang = "deu_Latn"
batch_size = 1
beam_size = 1

# Model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = transformers.AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M").to(device)
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=source_lang, tgt_lang=target_lang)
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang) # 256042

assert forced_bos_token_id == 256042, forced_bos_token_id

kwargs = {}

if force_type_error:
  # pipeline -> TypeError: transformers.generation.utils.GenerationMixin.generate() got multiple values for keyword argument 'forced_bos_token_id'
  kwargs["forced_bos_token_id"] = forced_bos_token_id

# Initialize
translator_pipeline = transformers.pipeline("translation", model=model, tokenizer=tokenizer, batch_size=batch_size, src_lang=source_lang,
                                            tgt_lang=target_lang, truncation=True, device=device)
inputs = tokenizer(source_text, return_tensors="pt", add_special_tokens=True, truncation=True).to(device)

# Translate and decode
output1 = translator_pipeline(source_text, num_beams=beam_size, **kwargs)
tokens1 = output1[0]["translation_text"]
output2 = model.generate(**inputs, num_beams=beam_size, forced_bos_token_id=forced_bos_token_id)
tokens2 = tokenizer.batch_decode(output2, skip_special_tokens=True)[0]

assert tokens1 == tokens2, f"{tokens1} VS {tokens2}" # I think they should be equal, but they are not

Maybe I'm missing some arguments to get the same result?

Expected behavior

I don't know if the TypeError is expected, but about the translations I would expect to obtain the same results, either "Bald werden neue Container in Wasenstraße entstehen ." or "Bald werden neue Container in Wasenstraße entstehen."

cgr71ii commented 2 weeks ago

I've tried to remove forced_bos_token_id from model.generate, but the problem is still there:

AssertionError: Bald werden neue Container in Wasenstraße entstehen . VS Σύντομα θα υπάρχουν νέα κοντέινερ στην Wasenstraße

It seems that the pipeline uses forced_bos_token_id somehow, because it translates correctly to German, unlike model.generate, when forced_bos_token_id is not set.

gante commented 1 week ago

Hi @cgr71ii 👋

The pipeline does indeed use tgt_lang, after it is set at initialization time. More precisely, it is used through a special tokenizer function which makes tgt_lang part of the outputs of the tokenizer. The outputs of the tokenizer are fed to generate, and thus we get an exception if we pass tgt_lang to the pipeline call.

I've double-checked, the inputs to generate are the same in both cases in your example (when force_type_error=False). The model config and generation configs are also the same. As expected when the model inputs and the configurations are the same, the token-level outputs in the two cases are also the same. At token decoding time, however, we have a different flag: pipeline sets clean_up_tokenization_spaces=False. If you set this flag in your batch_decode call, you'll get the same results 🤗

TL;DR: we make opinionated choices in our pipeline API, to make it beginner friendly. When comparing a manual workflow to a pipeline, double-check all flags inside the pipeline code :)

cgr71ii commented 1 week ago

Ohh, I see! Now I understand the problem. My fault; I should have checked the generated tokens :/ Thank you so much for the help and explanation :)

huggingface / transformers