🚀 Feature request

Extract the do_lower_case option to make it available for any tokenizer. Not just those that initially supported this, like the BERT tokenizers.

Motivation

Sometimes we want to specify do_lower_case=True in the tokenizer_config.json of a custom tokenizer to activate the lowercasing. The problem is that this obviously works only for tokenizers based on one that originally used this option.

I think we should extract this feature to make it a shared one, that could be used with any tokenizer.

Example of a model that would need this described here: https://github.com/huggingface/transformers/issues/9518

Special care points

Make sure the convert_slow_tokenizer script also handles this, to activate the option in the resulting fast tokenizer.
Maybe some other options could have the same treatment?

cc @LysandreJik @sgugger

huggingface / transformers

Allow `do_lower_case=True` for any tokenizer #10121

🚀 Feature request

Motivation

Special care points