Thank you for this wrapper!
I would like to propose following changes to api, and am contributing the implementation too:
Allow huggingface tokenizer's Encode method to optionally pass in add_special_tokens argument. Many models require these special tokens and prepending them to returned vector isn't optimal.
Allow huggingface tokenizer's Decode method to optionally pass in skip_special_tokens, again this saves time during using the string for downstream tasks, instead of slicing returned strings / trimming input vectors.
These changes would be backwards compatible. And users can use this by explicity initializing a HFTokenizer object or casting a Tokenizer* to HFTokenizer*, assuming it indeed is a HFTokenizer.
These changes will leave the Tokenizer interface untouched.
Thank you for this wrapper! I would like to propose following changes to api, and am contributing the implementation too:
Encode
method to optionally pass inadd_special_tokens
argument. Many models require these special tokens and prepending them to returned vector isn't optimal.Decode
method to optionally pass inskip_special_tokens
, again this saves time during using the string for downstream tasks, instead of slicing returned strings / trimming input vectors.These changes would be backwards compatible. And users can use this by explicity initializing a
HFTokenizer
object or casting aTokenizer*
toHFTokenizer*
, assuming it indeed is aHFTokenizer
.These changes will leave the
Tokenizer
interface untouched.