huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.03k stars 797 forks source link

How to use `TokenizerBuilder`? #1549

Closed polarathene closed 3 months ago

polarathene commented 4 months ago

I expected TokenizerBuilder to produce a Tokenizer from the build() result, but instead Tokenizer wraps TokenizerImpl.

No problem, I see that it impl From<TokenizerImpl> for Tokenizer, but it's attempting to do quite a bit more for some reason? Meanwhile I cannot use Tokenizer(unwrapped_build_result_here) as the struct is private 🤔 (while the Tokenizer::new() method won't take this in either)


let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    .with_model(unigram)
    .with_decoder(Some(decoder))
    .with_normalizer(Some(normalizer))
    .build()
    .map_err(anyhow::Error::msg)?
);
error[E0283]: type annotations needed
   --> mistralrs-core/src/pipeline/gguf_tokenizer.rs:139:41
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    |                                         ^^^^^^^^^^^^^^^^^^^^^ cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`
    |
    = note: cannot satisfy `_: tokenizers::PreTokenizer`
    = help: the following types implement trait `tokenizers::PreTokenizer`:
              tokenizers::pre_tokenizers::bert::BertPreTokenizer
              tokenizers::decoders::byte_level::ByteLevel
              tokenizers::pre_tokenizers::delimiter::CharDelimiterSplit
              tokenizers::pre_tokenizers::digits::Digits
              tokenizers::decoders::metaspace::Metaspace
              tokenizers::pre_tokenizers::punctuation::Punctuation
              tokenizers::pre_tokenizers::sequence::Sequence
              tokenizers::pre_tokenizers::split::Split
            and 4 others
note: required by a bound in `tokenizers::TokenizerBuilder::<M, N, PT, PP, D>::new`
   --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/mod.rs:314:9
    |
314 |     PT: PreTokenizer,
    |         ^^^^^^^^^^^^ required by this bound in `TokenizerBuilder::<M, N, PT, PP, D>::new`
...
319 |     pub fn new() -> Self {
    |            --- required by a bound in this associated function
help: consider specifying the generic arguments
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::<tokenizers::models::unigram::Unigram, tokenizers::NormalizerWrapper, PT, PP, tokenizers::DecoderWrapper>::new()
    |                                                         +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Why is this an issue? Isn't the point of the builder so that you don't have to specify the optional types not explicitly set?

cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`

I had a glance over the source on github but didn't see an example or test for using this API and the docs don't really cover it either.


Meanwhile with Tokenizer instead of TokenizerBuilder this works:

let mut tokenizer = Tokenizer::new(tokenizers::ModelWrapper::Unigram(unigram));
tokenizer.with_decoder(decoder);
tokenizer.with_normalizer(normalizer);
polarathene commented 4 months ago

In the meantime, I've implemented this alternative workaround (using buildstructor):

struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
    #[builder]
    fn try_new<'a>(
        with_model: ModelWrapper,
        with_decoder: Option<Decoder<'a>>,
        with_normalizer: Option<Normalizer<'a>>,
    ) -> Result<Tokenizer> {
        let mut tokenizer = Tokenizer::new(with_model);

        // Handle local enum to remote enum type:
        if let Some(decoder) = with_decoder {
            let d = DecoderWrapper::try_from(decoder)?;
            tokenizer.with_decoder(d);
        }
        if let Some(normalizer) = with_normalizer {
            let n = NormalizerWrapper::try_from(normalizer)?;
            tokenizer.with_normalizer(n);
        }

        Ok(tokenizer)
    }
}

Usage:

let mut tokenizer: Tokenizer = TokenizerX::try_builder()
    .with_model(model)
    .with_decoder(decoder)
    .with_normalizer(normalizer)
    .build()?;

The local to remote enum logic above is for the related DecoderWrapper + NormalizeWrapper enums which were also a bit noisy to use / grok, so I have a similar workaround for those:

let decoder = Decoder::Sequence(vec![
    Decoder::Replace("_", " "),
    Decoder::ByteFallback,
    Decoder::Fuse,
    Decoder::Strip(' ', 1, 0),
]);

let normalizer = Normalizer::Sequence(vec![
    Normalizer::Prepend("▁"),
    Normalizer::Replace(" ", "▁"),
]);

More details at mistral.rs.

ArthurZucker commented 4 months ago

The builder is I believe mostly used fro training

polarathene commented 4 months ago

@ArthurZucker perhaps you could better document that? Because by naming convention and current docs comment it implies it is the builder pattern for the Tokenizer struct:

Builder for Tokenizer structs.

It provides an API that matches what you'd expect of a builder API, and it's build() method returns a type that is used to construct a Tokenizer struct (which also has a From impl for this type):

https://github.com/huggingface/tokenizers/blob/1ff56c0c70b045f0cd82da1af9ac08cd4c7a6f9f/tokenizers/src/tokenizer/mod.rs#L464-L484

https://github.com/huggingface/tokenizers/blob/1ff56c0c70b045f0cd82da1af9ac08cd4c7a6f9f/tokenizers/src/tokenizer/mod.rs#L419-L436

https://github.com/huggingface/tokenizers/blob/1ff56c0c70b045f0cd82da1af9ac08cd4c7a6f9f/tokenizers/src/tokenizer/mod.rs#L408-L417


As the issue reports though, that doesn't seem to work very well, the builder API is awkward to use. You could probably adapt it to use buildstructor similar to how I have shown above with my TokenizerX workaround type (which also does a similar workaround for Decoder / Normalizer inputs to provide a better DX, but that is not required).

Presently, due to the reported issue here the builder offers little value vs creating the tokenizer without a fluent builder API.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.