openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.98k stars 816 forks source link

Add handling for empty input text in encode method #277

Closed pratyakshagarwal closed 4 months ago

pratyakshagarwal commented 6 months ago

Description This pull request addresses the issue #276 , where the encode method in the Encoding class of TikToken was not handling empty input text correctly. Previously, when the input text was empty, the method did not return any tokens, which was inconsistent with the behavior of other tokenizers. To resolve this issue, the encode method has been modified to return the token value corresponding to the special token 'ENDOFTEXT' when the input text is empty.

Changes Made Modified the encode method in the Encoding class to handle empty input text. When the input text is empty, the method now returns the token value corresponding to the special token 'ENDOFTEXT'.

Testing Added tests to verify the correct behavior of the encode method for empty input text. Ensured that the tests pass successfully.

Screenshots Screenshot 2024-04-06 172404 Screenshot 2024-04-06 171620

Related Issues Closes: #

hauntsaninja commented 4 months ago

Thanks! This is fine as is, nothing in tiktoken automatically concatenates endoftext tokens