Open n1t0 opened 3 years ago
Discussed offline with @n1t0: our current decision is to wait for https://github.com/huggingface/tokenizers/issues/659 to be resolved before moving on with this issue.
This is the better tradeoff as the alternative would imply duplicating a lot of logic in transformers
that's already present but not exposed by tokenizers
.
🚀 Feature request
Extract the
do_lower_case
option to make it available for any tokenizer. Not just those that initially supported this, like theBERT
tokenizers.Motivation
Sometimes we want to specify
do_lower_case=True
in thetokenizer_config.json
of a custom tokenizer to activate the lowercasing. The problem is that this obviously works only for tokenizers based on one that originally used this option.I think we should extract this feature to make it a shared one, that could be used with any tokenizer.
Example of a model that would need this described here: https://github.com/huggingface/transformers/issues/9518
Special care points
convert_slow_tokenizer
script also handles this, to activate the option in the resulting fast tokenizer.cc @LysandreJik @sgugger