openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

Enhanced Stability in Token Length Calculation for Whitespace Handling #267

Closed hvaria closed 3 months ago

hvaria commented 4 months ago

The enhancement of the _increase_last_piece_token_len function introduces several key improvements to the tokenization process

By extending the detection of trailing whitespace to include all preceding whitespace characters, this update ensures that token boundaries are determined with higher precision. This enhancement is crucial for applications where the semantic significance of whitespace cannot be overlooked, thereby improving the overall accuracy of tokenization.

The adoption of backward iteration from the last identified whitespace token directly targets the reduction of unnecessary iterations. This method is particularly beneficial in texts with significant stretches of consecutive whitespace, optimizing the performance by minimizing the computational overhead involved in these scenarios.

The function employs a closure, token_is_all_space, to check for tokens comprised entirely of whitespace. It leverages backward iteration for efficient identification of trailing whitespace's extent, dynamically adjusting last_piece_token_len as required. It incorporates a debug_assert! statement to ensure the logic's correctness and to safeguard against potential errors

Direct index access and looping could be slightly more efficient, as it avoids the overhead of iterator creation. However, this difference is minimal.

While the space complexity remains unchanged at O(1) for both the original and updated logic, indicating no additional memory usage, the significant improvement lies in the time complexity. The updated logic's time complexity is theoretically the same (O(n) in the worst case) but is expected to perform better on average due to more efficient iteration and early termination.