Closed sunildkumar closed 2 weeks ago
It might be related to https://github.com/huggingface/transformers/issues/31890 . Let me know if suggestions there work for you :)
Yep this seems to be related to additional spaces being added after special tokens cc @itazap
seems like with v4.41 this was already here, we did not break!
passing either from_slow=True
OR add_prefix_space=False
should fix it. Looks like unlike with from_slow=True
, the normalizer adds a prepend_scheme @ArthurZucker
yep!
It might be related to https://github.com/huggingface/transformers/issues/31890 . Let me know if suggestions there work for you :)
passing either from_slow=True OR add_prefix_space=False should fix it. Looks like unlike with from_slow=True, the normalizer adds a prepend_scheme
Thank you for your quick response and the suggestions.
I'm finding that add_prefix_space=False
doesn't work:
processor = LlavaNextProcessor.from_pretrained(
pretrained_model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf",
add_prefix_space=False,
)
text = "[INST] <image>\nWhat is shown in this image? [/INST]"
tokens = processor(text=text)['input_ids'].squeeze(0).tolist()
decoded_tokens = processor.decode(tokens)
>>> "<s> [INST] <image> \nWhat is shown in this image? [/INST]"
But from_slow=True
seems to work:
processor = LlavaNextProcessor.from_pretrained(
pretrained_model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf",
from_slow=True,
)
text = "[INST] <image>\nWhat is shown in this image? [/INST]"
tokens = processor(text=text)['input_ids'].squeeze(0).tolist()
decoded_tokens = processor.decode(tokens)
>>> "[INST] <image>\nWhat is shown in this image? [/INST]"
It was a recent fix! Perhaps try pulling latest on main? :hugs:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.42.4Who can help?
@zucchini-nlp @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm finding that the LLava-Next processors/tokenizers are adding spaces unexpectedly.
clean_up_tokenization_spaces
doesn' seem to fix this. Please see this google colab with repo and more information.Example:
Expected behavior
Decoding should yield the same text as I input.