huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.05k stars 26.3k forks source link

Different result when using model.generate or translation pipeline (and TypeError when argument 'forced_bos_token_id' is set in pipeline) #33172

Closed cgr71ii closed 1 week ago

cgr71ii commented 2 weeks ago

System Info

Who can help?

@gante @Narsil

Hi!

I'm generating some translations from English to German using NLLB, but I noticed different results using model.generate or pipeline, but only for a single instance: "Soon there will be new containers in Wasenstraße". The translations are:

The only difference is the final space before the period. I think the problem is related to the pipeline for two reasons:

  1. I can't set forced_bos_token_id in the pipeline, but I can do it in model.generate. I think this may be causing the difference between pipeline and model.generate.
  2. I've implemented my own generate, and it generates the same result as model.generate. Of course, my code may be wrong.

The raised error when I try to set forced_bos_token_id in the pipeline is:

TypeError: transformers.generation.utils.GenerationMixin.generate() got multiple values for keyword argument 'forced_bos_token_id'

This is very similar to https://github.com/huggingface/transformers/issues/24104

Thank you!

Information

Tasks

Reproduction

Code to reproduce the TypeError (set force_type_error=True) and the different translations (set force_type_error=False):

import torch
import transformers

force_type_error = True # TODO change to False to see the other issue

# Variables
source_text = "Soon there will be new containers in Wasenstraße"
source_lang = "eng_Latn"
target_lang = "deu_Latn"
batch_size = 1
beam_size = 1

# Model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = transformers.AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M").to(device)
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=source_lang, tgt_lang=target_lang)
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang) # 256042

assert forced_bos_token_id == 256042, forced_bos_token_id

kwargs = {}

if force_type_error:
  # pipeline -> TypeError: transformers.generation.utils.GenerationMixin.generate() got multiple values for keyword argument 'forced_bos_token_id'
  kwargs["forced_bos_token_id"] = forced_bos_token_id

# Initialize
translator_pipeline = transformers.pipeline("translation", model=model, tokenizer=tokenizer, batch_size=batch_size, src_lang=source_lang,
                                            tgt_lang=target_lang, truncation=True, device=device)
inputs = tokenizer(source_text, return_tensors="pt", add_special_tokens=True, truncation=True).to(device)

# Translate and decode
output1 = translator_pipeline(source_text, num_beams=beam_size, **kwargs)
tokens1 = output1[0]["translation_text"]
output2 = model.generate(**inputs, num_beams=beam_size, forced_bos_token_id=forced_bos_token_id)
tokens2 = tokenizer.batch_decode(output2, skip_special_tokens=True)[0]

assert tokens1 == tokens2, f"{tokens1} VS {tokens2}" # I think they should be equal, but they are not

Maybe I'm missing some arguments to get the same result?

Expected behavior

I don't know if the TypeError is expected, but about the translations I would expect to obtain the same results, either "Bald werden neue Container in Wasenstraße entstehen ." or "Bald werden neue Container in Wasenstraße entstehen."

cgr71ii commented 2 weeks ago

I've tried to remove forced_bos_token_id from model.generate, but the problem is still there:

AssertionError: Bald werden neue Container in Wasenstraße entstehen . VS Σύντομα θα υπάρχουν νέα κοντέινερ στην Wasenstraße

It seems that the pipeline uses forced_bos_token_id somehow, because it translates correctly to German, unlike model.generate, when forced_bos_token_id is not set.

gante commented 1 week ago

Hi @cgr71ii 👋

The pipeline does indeed use tgt_lang, after it is set at initialization time. More precisely, it is used through a special tokenizer function which makes tgt_lang part of the outputs of the tokenizer. The outputs of the tokenizer are fed to generate, and thus we get an exception if we pass tgt_lang to the pipeline call.

I've double-checked, the inputs to generate are the same in both cases in your example (when force_type_error=False). The model config and generation configs are also the same. As expected when the model inputs and the configurations are the same, the token-level outputs in the two cases are also the same. At token decoding time, however, we have a different flag: pipeline sets clean_up_tokenization_spaces=False. If you set this flag in your batch_decode call, you'll get the same results 🤗

TL;DR: we make opinionated choices in our pipeline API, to make it beginner friendly. When comparing a manual workflow to a pipeline, double-check all flags inside the pipeline code :)

cgr71ii commented 1 week ago

Ohh, I see! Now I understand the problem. My fault; I should have checked the generated tokens :/ Thank you so much for the help and explanation :)