huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.58k stars 26.19k forks source link

Llava-NeXT processor inconsistencies - unexpected spaces #31921

Closed sunildkumar closed 2 weeks ago

sunildkumar commented 1 month ago

System Info

Who can help?

@zucchini-nlp @ArthurZucker

Information

Tasks

Reproduction

I'm finding that the LLava-Next processors/tokenizers are adding spaces unexpectedly. clean_up_tokenization_spaces doesn' seem to fix this. Please see this google colab with repo and more information.

Example:

from transformers import LlavaNextProcessor

processor = LlavaNextProcessor.from_pretrained(
        pretrained_model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf"
        )

text = "[INST] <image>\nWhat is shown in this image? [/INST]"

tokens = processor(text=text)['input_ids'].squeeze(0).tolist()

decoded_tokens = processor.decode(tokens)

print(decoded_tokens)
>>> "<s> [INST] <image> \nWhat is shown in this image? [/INST]"
                       ^  notice the extra space between <image> and \n that isn't in the original encoded text

Expected behavior

Decoding should yield the same text as I input.

zucchini-nlp commented 1 month ago

It might be related to https://github.com/huggingface/transformers/issues/31890 . Let me know if suggestions there work for you :)

ArthurZucker commented 1 month ago

Yep this seems to be related to additional spaces being added after special tokens cc @itazap

ArthurZucker commented 1 month ago
image

seems like with v4.41 this was already here, we did not break!

itazap commented 1 month ago

passing either from_slow=True OR add_prefix_space=False should fix it. Looks like unlike with from_slow=True, the normalizer adds a prepend_scheme @ArthurZucker

ArthurZucker commented 1 month ago

yep!

sunildkumar commented 1 month ago

It might be related to https://github.com/huggingface/transformers/issues/31890 . Let me know if suggestions there work for you :)

passing either from_slow=True OR add_prefix_space=False should fix it. Looks like unlike with from_slow=True, the normalizer adds a prepend_scheme

Thank you for your quick response and the suggestions.

I'm finding that add_prefix_space=False doesn't work:

processor = LlavaNextProcessor.from_pretrained(
        pretrained_model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf",
        add_prefix_space=False,
        )

text = "[INST] <image>\nWhat is shown in this image? [/INST]"

tokens = processor(text=text)['input_ids'].squeeze(0).tolist()

decoded_tokens = processor.decode(tokens)
>>> "<s> [INST] <image> \nWhat is shown in this image? [/INST]"

But from_slow=True seems to work:

processor = LlavaNextProcessor.from_pretrained(
        pretrained_model_name_or_path="llava-hf/llava-v1.6-mistral-7b-hf",
        from_slow=True,
        )

text = "[INST] <image>\nWhat is shown in this image? [/INST]"

tokens = processor(text=text)['input_ids'].squeeze(0).tolist()

decoded_tokens = processor.decode(tokens)
>>> "[INST] <image>\nWhat is shown in this image? [/INST]" 
itazap commented 1 month ago

It was a recent fix! Perhaps try pulling latest on main? :hugs:

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.