Cannot recognize <|endoftext|> - Githubissues

graykode / gpt-2-Pytorch

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

MIT License

963 stars 225 forks source link

Cannot recognize <|endoftext|> #24

Open wangkuiyi opened 1 year ago

wangkuiyi commented 1 year ago

Thank you for this project! It is very helpful for me to understand how GPT2 synthesize text.

I also noticed that the GPT2/encoder.py does not implement the capability of recognizing special tokens as the HuggingFace tokenzier could.

The part of source code in HuggingFace's repo is at https://github.com/huggingface/transformers/blob/c836f77266be9ace47bff472f63caf71c0d11333/src/transformers/tokenization_utils.py#L516-L520

I understand that it is not critical, because there is only one special token <|endoftext|> in use https://github.com/wangkuiyi/huggingface-tokenizer-in-cxx/issues/11

So, just saying.