Closed polarathene closed 3 months ago
In the meantime, I've implemented this alternative workaround (using buildstructor
):
struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
#[builder]
fn try_new<'a>(
with_model: ModelWrapper,
with_decoder: Option<Decoder<'a>>,
with_normalizer: Option<Normalizer<'a>>,
) -> Result<Tokenizer> {
let mut tokenizer = Tokenizer::new(with_model);
// Handle local enum to remote enum type:
if let Some(decoder) = with_decoder {
let d = DecoderWrapper::try_from(decoder)?;
tokenizer.with_decoder(d);
}
if let Some(normalizer) = with_normalizer {
let n = NormalizerWrapper::try_from(normalizer)?;
tokenizer.with_normalizer(n);
}
Ok(tokenizer)
}
}
Usage:
let mut tokenizer: Tokenizer = TokenizerX::try_builder()
.with_model(model)
.with_decoder(decoder)
.with_normalizer(normalizer)
.build()?;
The local to remote enum logic above is for the related DecoderWrapper
+ NormalizeWrapper
enums which were also a bit noisy to use / grok, so I have a similar workaround for those:
let decoder = Decoder::Sequence(vec![
Decoder::Replace("_", " "),
Decoder::ByteFallback,
Decoder::Fuse,
Decoder::Strip(' ', 1, 0),
]);
let normalizer = Normalizer::Sequence(vec![
Normalizer::Prepend("▁"),
Normalizer::Replace(" ", "▁"),
]);
The builder is I believe mostly used fro training
@ArthurZucker perhaps you could better document that? Because by naming convention and current docs comment it implies it is the builder pattern for the Tokenizer
struct:
Builder for
Tokenizer
structs.
It provides an API that matches what you'd expect of a builder API, and it's build()
method returns a type that is used to construct a Tokenizer
struct (which also has a From
impl for this type):
As the issue reports though, that doesn't seem to work very well, the builder API is awkward to use. You could probably adapt it to use buildstructor
similar to how I have shown above with my TokenizerX
workaround type (which also does a similar workaround for Decoder
/ Normalizer
inputs to provide a better DX, but that is not required).
Presently, due to the reported issue here the builder offers little value vs creating the tokenizer without a fluent builder API.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I expected
TokenizerBuilder
to produce aTokenizer
from thebuild()
result, but insteadTokenizer
wrapsTokenizerImpl
.No problem, I see that it impl
From<TokenizerImpl> for Tokenizer
, but it's attempting to do quite a bit more for some reason? Meanwhile I cannot useTokenizer(unwrapped_build_result_here)
as the struct is private 🤔 (while theTokenizer::new()
method won't take this in either)Why is this an issue? Isn't the point of the builder so that you don't have to specify the optional types not explicitly set?
I had a glance over the source on github but didn't see an example or test for using this API and the docs don't really cover it either.
Meanwhile with
Tokenizer
instead ofTokenizerBuilder
this works: