microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
21.36k stars 3.14k forks source link

encoder.json/vocab.bpe show up in every project that uses SK #2679

Closed stephentoub closed 1 year ago

stephentoub commented 1 year ago

Bringing in the Microsoft.SemanticKernel nuget package causes these files to show up in the consuming application: image and they end up in the output directory for the application, regardless of whether the app is using the tokenizer or not. image

I'd opened https://github.com/microsoft/semantic-kernel/pull/1800 to turn them into assembly resources instead, so that they'd simply be part of the assembly and not separate files, but it was closed due to a lack of a decision about what to do with it.

anthonypuppo commented 1 year ago

Echoing my comment on #1800

Bringing up SharpToken as this would solve 1) embedding of the tokenization resource files 2) extracting the tokenization logic to a separate package (maybe wrapped as an official SK package in the future) and 3) bug https://github.com/microsoft/semantic-kernel/issues/2334.

lemillermicrosoft commented 1 year ago

One other option could be to move the tokenizer into an extension package for OpenAI connector.