Closed YuboFeng2023 closed 1 year ago
This is not bug.
For opt tokenizer, it uses same token id for bos_token and eos_token (tokenizer.bos_token_id == tokenizer.eos_token_id == 2
)
This is not bug. For opt tokenizer, it uses same token id for bos_token and eos_token (
tokenizer.bos_token_id == tokenizer.eos_token_id == 2
)
Thank you for your reply, but this is really different from the conventional method. Cloud you please provide some evidence?
You can find it at https://huggingface.co/docs/transformers/model_doc/opt.
You can find it at https://huggingface.co/docs/transformers/model_doc/opt.
Thank you very much! This answer is very solid!
Hi!
What a fabulous sentence embedding model you have created! But there may be bugs in your code:
The sentences:
After encode and decode by your OPT tokenizer, the sentences become:
LOOK! there are eos_token, <\/s>, at the beginning of sentence. This is inconsistent with our usual understanding of tokenization. I think the correct tokenization should be:
Can you provide some guidance on this setting?