Closed cgr71ii closed 1 week ago
I've tried to remove forced_bos_token_id
from model.generate, but the problem is still there:
AssertionError: Bald werden neue Container in Wasenstraße entstehen . VS Σύντομα θα υπάρχουν νέα κοντέινερ στην Wasenstraße
It seems that the pipeline uses forced_bos_token_id
somehow, because it translates correctly to German, unlike model.generate, when forced_bos_token_id
is not set.
Hi @cgr71ii 👋
The pipeline does indeed use tgt_lang
, after it is set at initialization time. More precisely, it is used through a special tokenizer function which makes tgt_lang
part of the outputs of the tokenizer. The outputs of the tokenizer are fed to generate
, and thus we get an exception if we pass tgt_lang
to the pipeline call.
I've double-checked, the inputs to generate
are the same in both cases in your example (when force_type_error=False
). The model config and generation configs are also the same. As expected when the model inputs and the configurations are the same, the token-level outputs in the two cases are also the same. At token decoding time, however, we have a different flag: pipeline sets clean_up_tokenization_spaces=False
. If you set this flag in your batch_decode
call, you'll get the same results 🤗
TL;DR: we make opinionated choices in our pipeline
API, to make it beginner friendly. When comparing a manual workflow to a pipeline
, double-check all flags inside the pipeline
code :)
Ohh, I see! Now I understand the problem. My fault; I should have checked the generated tokens :/ Thank you so much for the help and explanation :)
System Info
transformers
version: 4.43.3Who can help?
@gante @Narsil
Hi!
I'm generating some translations from English to German using NLLB, but I noticed different results using model.generate or pipeline, but only for a single instance: "Soon there will be new containers in Wasenstraße". The translations are:
The only difference is the final space before the period. I think the problem is related to the pipeline for two reasons:
forced_bos_token_id
in the pipeline, but I can do it in model.generate. I think this may be causing the difference between pipeline and model.generate.The raised error when I try to set
forced_bos_token_id
in the pipeline is:This is very similar to https://github.com/huggingface/transformers/issues/24104
Thank you!
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Code to reproduce the TypeError (set
force_type_error=True
) and the different translations (setforce_type_error=False
):Maybe I'm missing some arguments to get the same result?
Expected behavior
I don't know if the TypeError is expected, but about the translations I would expect to obtain the same results, either "Bald werden neue Container in Wasenstraße entstehen ." or "Bald werden neue Container in Wasenstraße entstehen."