kongds / scaling_sentemb

Scaling Sentence Embeddings with Large Language Models
85 stars 4 forks source link

Maybe there are errors in your tokenization #4

Closed YuboFeng2023 closed 10 months ago

YuboFeng2023 commented 10 months ago

Hi!

What a fabulous sentence embedding model you have created! But there may be bugs in your code:

The sentences:

['hello', 'hello hello hello']

After encode and decode by your OPT tokenizer, the sentences become:

<s><s><s></s>Hello
</s>Hello Word Hello Word

LOOK! there are eos_token, <\/s>, at the beginning of sentence. This is inconsistent with our usual understanding of tokenization. I think the correct tokenization should be:

<s><s><s><s>Hello
<s>Hello Word Hello Word

Can you provide some guidance on this setting?

kongds commented 10 months ago

This is not bug. For opt tokenizer, it uses same token id for bos_token and eos_token (tokenizer.bos_token_id == tokenizer.eos_token_id == 2)

YuboFeng2023 commented 10 months ago

This is not bug. For opt tokenizer, it uses same token id for bos_token and eos_token (tokenizer.bos_token_id == tokenizer.eos_token_id == 2)

Thank you for your reply, but this is really different from the conventional method. Cloud you please provide some evidence?

kongds commented 10 months ago

You can find it at https://huggingface.co/docs/transformers/model_doc/opt.

image image
YuboFeng2023 commented 10 months ago

You can find it at https://huggingface.co/docs/transformers/model_doc/opt.

image image

Thank you very much! This answer is very solid!