microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
20.48k stars 2.97k forks source link

Update Tokenizer to use Microsoft.ML.Tokenizers library #478

Closed luisquintanilla closed 9 months ago

luisquintanilla commented 1 year ago

The existing tokenizer implementation supports only GPT models. The Microsoft.ML.Tokenizers package provides a BPE tokenizer implementation which can be used with GPT models. In addition though, you can also load your own vocabulary files for use with other models that support BPE tokenization.

Here is a related sample that loads custom vocab files for GPT-2 from HuggingFace.

https://gist.github.com/luisquintanilla/bc91de8668cfa7c3755b20329fadd027

API Documentation

evchaki commented 1 year ago

@luisquintanilla , thanks for the suggestion, we will take a look.

dluc commented 1 year ago

@luisquintanilla we're doing some work to integrate tiktoken, does ML.Tokenizer include additional tokenizers, not part of tiktoken?

luisquintanilla commented 1 year ago

@luisquintanilla we're doing some work to integrate tiktoken, does ML.Tokenizer include additional tokenizers, not part of tiktoken?

@dluc Currently ML.Tokenizers supports only BPE which I think is also the only one tiktoken supports.

@tarekgh can confirm which tokenizers are supported.

tarekgh commented 1 year ago

Right, ML.Tokenizers supports support Bpe which tiktoken support according to https://github.com/openai/tiktoken. ML.Tokenizers support EnglishRoberta too but this is not supported by tiktoken.

dluc commented 1 year ago

thanks for the info, work in progress

JadynWong commented 1 year ago

https://github.com/microsoft/Tokenizer

This repo contains C# and Typescript implementation of byte pair encoding(BPE) tokenizer for OpenAI LLMs, it's based on open sourced rust implementation in the OpenAI tiktoken. Both implementation are valuable to run prompt tokenization in .NET and Nodejs environment before feeding prompt into a LLM.

lemillermicrosoft commented 9 months ago

2809 and #2147 made it easier to bring any tokenizer for use. #2840 demonstrates these capabilities specifically with MicrosoftML and DeepDev token counters along existing demonstration with SharpToken.

luisquintanilla commented 9 months ago

Thanks for these changes @lemillermicrosoft. They look great. Feel free to close this when the PR merges.

KokinSok commented 4 months ago

Fantastic Utility, Thank You!

What class supports the BERT Type Tokenizers? Same sort of thing as FastBertTokenizer and BertTokenizer?

public sealed class Bpe : Model public sealed class EnglishRoberta : Model public sealed class Tiktoken : Model

These three classes seem to be the only Classes that Implement the Model Class.

The class EnglishRoberta does not give the same result as the previously mentioned classes for the given vocab files.

Would be fantastic to use one class for all the main models atm.

Thank You!

KashMoneyMillionaire commented 3 months ago

is there any sort of cross-team effort (between https://github.com/dotnet/machinelearning and https://github.com/microsoft/semantic-kernel) to get the dotnet version of semantic kernel only using ML.NET tokenizers? This seems like it was a first step, so trying to track progress of further work. I see that dotnet semantic kernel uses 4 different tokenizer libraries (from microsoft/semantic-kernel/dotnet/Directory.Packages.props): image

stephentoub commented 3 months ago

I see that dotnet semantic kernel uses 4 different tokenizer libraries (from microsoft/semantic-kernel/dotnet/Directory.Packages.props)

That's just in the samples. The actual SK libraries don't use most of them (the ONNX library does use the FastBertTokenizer, but there's a separate issue tracking adding a BERT tokenizer to the ML.NET tokenizer lib).

KashMoneyMillionaire commented 3 months ago

Ahh, that makes sense. I saw SharpToken was only used in samples but FastBert was used in code, so stopped there and didn't look at DeepDev. That was my 2nd question though - do the samples need different tokenizers, or should we be pushing them to use the ML.NET tokenizer libraries? Or should we be removing those libraries from the dotnet/Directory.Packages.props and just put them straight in the dotnet/samples/KernelSyntaxExamples/KernelSyntaxExamples.csproj?

stephentoub commented 3 months ago

Our collective goal is that all relevant tokenizers end up in Microsoft.ML.Tokenizers so that it's a one-stop shop for tokenization needs in .NET. Microsoft.ML.Tokenizers now provides a Tiktoken implementation, so SK wouldn't need DeepDev or SharpToken, but it doesn't yet have a BERT tokenizer. Different models need different tokenizers; tiktoken is what's used by OpenAI's models. We could probably at this point refine the sample that's using DeepDev and SharpToken to not, but really all that sample is showing is that you can use different libraries within the delegate passed to TextChunker.

KashMoneyMillionaire commented 3 months ago

Love it, that makes a lot of sense. I've been diving into these two repos and roadmaps and plans and docs over the last few days, and I got the sense that was the goal, but was difficult to find a definitive place essentially saying that. I'm assuming the ML.NET Roadmap is out of date, but that the ML.NET Milestones are up to date. Any pointers on the best place to find info similar to that for SK?

Thanks for the time clarifying!