[Bug] Fix for LlamaCpp tokeniser prepending spaces

riedgar-ms commented 3 months ago

There seems to be a bug in the LlamaCpp tokenisers, where they prepend spaces. Fix this following @mmoskal , by prepending a byte which is extremely unlikely to occur in a real string, and using it to figure out the offending prefix.

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.29%. Comparing base (870a4f9) to head (7bd6620).

Files	Patch %	Lines
guidance/models/llama_cpp/_llama_cpp.py	83.33%	2 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #903 +/- ## ========================================== + Coverage 54.50% 60.29% +5.78% ========================================== Files 64 64 Lines 4680 4684 +4 ========================================== + Hits 2551 2824 +273 + Misses 2129 1860 -269 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

riedgar-ms commented 3 months ago

@mmoskal I do have the same tests implemented for some transformers tokenisers in #899 ; those are working

riedgar-ms commented 3 months ago

I have also run a couple of the notebooks which use LlamaCpp models, and those have been fine as well.

guidance-ai / guidance

[Bug] Fix for LlamaCpp tokeniser prepending spaces #903

Codecov Report