huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.62k stars 27.15k forks source link

Allow `do_lower_case=True` for any tokenizer #10121

Open n1t0 opened 3 years ago

n1t0 commented 3 years ago

🚀 Feature request

Extract the do_lower_case option to make it available for any tokenizer. Not just those that initially supported this, like the BERT tokenizers.

Motivation

Sometimes we want to specify do_lower_case=True in the tokenizer_config.json of a custom tokenizer to activate the lowercasing. The problem is that this obviously works only for tokenizers based on one that originally used this option.

I think we should extract this feature to make it a shared one, that could be used with any tokenizer.

Example of a model that would need this described here: https://github.com/huggingface/transformers/issues/9518

Special care points

cc @LysandreJik @sgugger

theo-m commented 3 years ago

Discussed offline with @n1t0: our current decision is to wait for https://github.com/huggingface/tokenizers/issues/659 to be resolved before moving on with this issue. This is the better tradeoff as the alternative would imply duplicating a lot of logic in transformers that's already present but not exposed by tokenizers.