Maybe there are errors in your tokenization

kongds / scaling_sentemb

Scaling Sentence Embeddings with Large Language Models

100 stars 4 forks source link

Maybe there are errors in your tokenization #4

Closed YuboFeng2023 closed 1 year ago

YuboFeng2023 commented 1 year ago

Hi!

What a fabulous sentence embedding model you have created! But there may be bugs in your code:

The sentences:

['hello', 'hello hello hello']

After encode and decode by your OPT tokenizer, the sentences become:

<s><s><s></s>Hello
</s>Hello Word Hello Word

LOOK! there are eos_token, <\/s>, at the beginning of sentence. This is inconsistent with our usual understanding of tokenization. I think the correct tokenization should be:

<s><s><s><s>Hello
<s>Hello Word Hello Word

Can you provide some guidance on this setting?

kongds commented 1 year ago

This is not bug. For opt tokenizer, it uses same token id for bos_token and eos_token (tokenizer.bos_token_id == tokenizer.eos_token_id == 2)

YuboFeng2023 commented 1 year ago

This is not bug. For opt tokenizer, it uses same token id for bos_token and eos_token (tokenizer.bos_token_id == tokenizer.eos_token_id == 2)

Thank you for your reply, but this is really different from the conventional method. Cloud you please provide some evidence?

kongds commented 1 year ago

You can find it at https://huggingface.co/docs/transformers/model_doc/opt.

YuboFeng2023 commented 1 year ago

You can find it at https://huggingface.co/docs/transformers/model_doc/opt.

Thank you very much! This answer is very solid!